Outcome data


Once instruments for the exposure trait have been specified, those variants need to be extracted from the outcome trait.

Available studies in IEU GWAS database

The IEU GWAS database (IGD) contains complete GWAS summary statistics from a large number of studies. You can browse them here:


To obtain details about the available GWASs programmatically do the following:

ao <- available_outcomes()
#>           id         trait ncase group_name year       author consortium
#> 1 ieu-b-5103 Schizophrenia  1234     public 2022 Trubetskoy V        PGC
#> 2 ieu-b-5102 Schizophrenia 52017     public 2022 Trubetskoy V        PGC
#> 3 ieu-b-5101 Schizophrenia 12305     public 2022 Trubetskoy V        PGC
#> 4 ieu-b-5100 Schizophrenia 64322     public 2022 Trubetskoy V        PGC
#> 5 ieu-b-5099 Schizophrenia 76755     public 2022 Trubetskoy V        PGC
#> 6 ieu-b-5098 Schizophrenia  5998     public 2022 Trubetskoy V        PGC
#>                 sex     pmid                         population  unit
#> 1 Males and Females 35396580         Hispanic or Latin American logOR
#> 2 Males and Females 35396580                           European logOR
#> 3 Males and Females 35396580                         East Asian logOR
#> 4 Males and Females 35396580                              Mixed logOR
#> 5 Males and Females 35396580                              Mixed logOR
#> 6 Males and Females 35396580 African American or Afro-Caribbean logOR
#>   sample_size       build ncontrol category subcategory      ontology
#> 1        4324 HG19/GRCh37     3090  Disease          NA MONDO:0005090
#> 2      127906 HG19/GRCh37    75889  Disease          NA MONDO:0005090
#> 3       27363 HG19/GRCh37    15058  Disease          NA MONDO:0005090
#> 4      155269 HG19/GRCh37    90947  Disease          NA MONDO:0005090
#> 5      320404 HG19/GRCh37   243649  Disease          NA MONDO:0005090
#> 6        9824 HG19/GRCh37     3826  Disease          NA MONDO:0005090
#>                                                                      note mr
#> 1                                                                    <NA> NA
#> 2                                                                    <NA> NA
#> 3                                                                    <NA> NA
#> 4                            Core - East Asian and European meta analysis NA
#> 5 Primary - meta analysis of Eur, East Asian, African American and Latino NA
#> 6                                                                    <NA> NA
#>   nsnp  doi coverage study_design priority sd
#> 1   NA <NA>     <NA>         <NA>       NA NA
#> 2   NA <NA>     <NA>         <NA>       NA NA
#> 3   NA <NA>     <NA>         <NA>       NA NA
#> 4   NA <NA>     <NA>         <NA>       NA NA
#> 5   NA <NA>     <NA>         <NA>       NA NA
#> 6   NA <NA>     <NA>         <NA>       NA NA

For information about authentication see https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication.

The available_outcomes function returns a table of all the available studies in the database. Each study has a unique ID. e.g.

head(subset(ao, select = c(trait, id)))
#>           trait         id
#> 1 Schizophrenia ieu-b-5103
#> 2 Schizophrenia ieu-b-5102
#> 3 Schizophrenia ieu-b-5101
#> 4 Schizophrenia ieu-b-5100
#> 5 Schizophrenia ieu-b-5099
#> 6 Schizophrenia ieu-b-5098

Extracting particular SNPs from particular studies

If we want to perform MR of BMI against coronary heart disease, we need to identify the SNPs that influence the BMI, and then extract those SNPs from a GWAS on coronary heart disease.

Let’s get the Locke et al 2014 instruments for BMI as an example:

bmi_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
#>   pval.exposure samplesize.exposure chr.exposure se.exposure beta.exposure
#> 1   2.18198e-08              339152            1      0.0030       -0.0168
#> 2   4.56773e-11              339065            1      0.0031        0.0201
#> 3   5.05941e-14              313621            1      0.0087        0.0659
#> 4   5.45205e-10              338768            1      0.0029        0.0181
#> 5   1.88018e-28              338123            1      0.0030        0.0331
#> 6   2.28718e-40              339078            1      0.0037        0.0497
#>   pos.exposure id.exposure        SNP effect_allele.exposure
#> 1     47684677     ieu-a-2   rs977747                      G
#> 2     78048331     ieu-a-2 rs17381664                      C
#> 3    110082886     ieu-a-2  rs7550711                      T
#> 4    201784287     ieu-a-2  rs2820292                      C
#> 5     72837239     ieu-a-2  rs7531118                      C
#> 6    177889480     ieu-a-2   rs543874                      G
#>   other_allele.exposure eaf.exposure                      exposure
#> 1                     T       0.5333 Body mass index || id:ieu-a-2
#> 2                     T       0.4250 Body mass index || id:ieu-a-2
#> 3                     C       0.0339 Body mass index || id:ieu-a-2
#> 4                     A       0.5083 Body mass index || id:ieu-a-2
#> 5                     T       0.6083 Body mass index || id:ieu-a-2
#> 6                     A       0.2667 Body mass index || id:ieu-a-2
#>   mr_keep.exposure pval_origin.exposure data_source.exposure
#> 1             TRUE             reported                  igd
#> 2             TRUE             reported                  igd
#> 3             TRUE             reported                  igd
#> 4             TRUE             reported                  igd
#> 5             TRUE             reported                  igd
#> 6             TRUE             reported                  igd

We now need to find a suitable GWAS for coronary heart disease. We can search the available studies:

ao[grepl("heart disease", ao$trait), ]
The most recent CARDIOGRAM GWAS is ID number ieu-a-7. We can extract the BMI SNPs from this GWAS as follows:

chd_out_dat1 <- extract_outcome_data(
    snps = bmi_exp_dat$SNP,
    outcomes = 'ieu-a-7'

The extract_outcome_data() function is flexible. The snps argument only requires an array of rsIDs, and the outcomes argument can be a vector of outcomes, e.g.

chd_out_dat2 <- extract_outcome_data(
    snps = c("rs234", "rs17097147"),
    outcomes = c('ieu-a-2', 'ieu-a-7')

will extract the two SNPs from each of the outcomes ieu-a-2 and ieu-a-7.

LD proxies

By default if a particular requested SNP is not present in the outcome GWAS then a SNP (proxy) that is in LD with the requested SNP (target) will be searched for instead. LD proxies are defined using 1000 genomes European sample data. The effect of the proxy SNP on the outcome is returned, along with the proxy SNP, the effect allele of the proxy SNP, and the corresponding allele (in phase) for the target SNP.

The parameters for handling LD proxies are as follows:

  • proxies = TRUE or FALSE (TRUE by default)
  • rsq = numeric value of minimum rsq to find a proxy. Default is 0.8, minimum is 0.6
  • palindromes = Allow palindromic SNPs? Default is 1 (yes)
  • maf_threshold = If palindromes allowed then what is the maximum minor allele frequency of palindromes allowed? Default is 0.3.

Using local GWAS summary data

If you have GWAS summary data that is not present in IEU GWAS database, this can still be used to perform analysis.

Supposing there is a GWAS summary file called “gwas_summary.csv” with e.g. 2 million rows and it looks like this:


To extract the exposure SNPs from this data, we would use the following command:

outcome_dat <- read_outcome_data(
    snps = bmi_exp_dat$SNP,
    filename = "gwas_summary.csv",
    sep = ",",
    snp_col = "rsid",
    beta_col = "effect",
    se_col = "SE",
    effect_allele_col = "a1",
    other_allele_col = "a2",
    eaf_col = "a1_freq",
    pval_col = "p-value",
    units_col = "Units",
    gene_col = "Gene",
    samplesize_col = "n"

This returns an outcome data frame with only the SNPs that were requested (if those SNPs were present in the “gwas_summary.csv” file).

Outcome data format

The extract_outcome_data function returns a table of SNP effects for the requested SNPs on the requested outcomes. The format of the data is similar to the exposure data format, except the main columns are as follows:

  • SNP
  • beta.outcome
  • se.outcome
  • samplesize.outcome
  • ncase.outcome
  • ncontrol.outcome
  • pval.outcome
  • eaf.outcome
  • effect_allele.outcom
  • other_allele.outcome
  • units.outcome
  • outcome
  • consortium.outcome
  • year.outcome
  • pmid.outcome
  • id.outcome
  • originalname.outcome
  • proxy.outcome
  • target_snp.outcome
  • proxy_snp.outcome
  • target_a1.outcome
  • target_a2.outcome
  • proxy_a1.outcome
  • proxy_a2.outcome
  • mr_keep.outcome
  • data_source.outcome

More advanced use of local data

We have developed a summary data format called “GWAS VCF”, which is designed to store GWAS results in a strict and performant way. It is possible to use this format with the TwoSampleMR package. Going down this avenue also allows you to use LD proxy functionality using your own LD reference files (or ones that we provide). For more details, see this package that explains the format and how to query it in R:


and this package for how to connect the data to other packages including TwoSampleMR
