---
title: "Exposure data"
author: "Gibran Hemani"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
bibliography: refs.bib
vignette: >
  %\VignetteIndexEntry{Exposure data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE
)
```

```{r message=FALSE}
library(TwoSampleMR)
```

```{r, eval=FALSE, echo=FALSE}
# Get data for the vignette
ao <- available_outcomes()
bmi2014_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
bmi_exp_dat <- clump_data(bmi2014_exp_dat)
save(ao, bmi_exp_dat, bmi2014_exp_dat, file = file.path("inst", "extdata", "vig_exposure.RData"), compress = "xz")
```

```{r, echo=FALSE, eval=TRUE, warning=FALSE}
load(system.file("extdata", "vig_exposure.RData", package = "TwoSampleMR"))
```

## Introduction

A data frame of the instruments for an exposure is required. Each line has the information for one variant for one exposure. The minimum information required for MR analysis is the following:

- `SNP` - rs ID 
- `beta` - The effect size. If the trait is binary then log(OR) should be used
- `se` - The standard error of the effect size
- `effect_allele` - The allele of the SNP which has the effect marked in `beta`

Other information that is useful for MR can also be provided:

- `other_allele` - The non-effect allele
- `eaf` - The effect allele frequency
- `Phenotype` - The name of the phenotype for which the SNP has an effect

You can also provide the following extra information:

- `chr` - Physical position of variant (chromosome)
- `position` - Physical position of variant (position)
- `samplesize` - Sample size for estimating the effect size
- `ncase` - Number of cases
- `ncontrol` - Number of controls
- `pval` - The P-value for the SNP's association with the exposure
- `units` - The units in which the effects are presented
- `gene` - The gene or other annotation for the the SNP

## Reading in from a file

The data can be read in from a text file using the `read_exposure_data` function. The file must have a header with column names corresponding to the columns described above.

### Example 1: The default column names are used

An example of a text file with the default column names is provided as part of the package, the first few rows look like this:

```
Phenotype SNP beta se effect_allele other_allele eaf pval units gene samplesize
BMI rs10767664 0.19 0.0306122448979592 A T 0.78 5e-26 kg/m2 BDNF 225238
BMI rs13078807 0.1 0.0204081632653061 G A 0.2 4e-11 kg/m2 CADM2 221431
BMI rs1514175 0.07 0.0204081632653061 A G 0.43 8e-14 kg/m2 TNNI3K 207641
BMI rs1558902 0.39 0.0204081632653061 A T 0.42 5e-120 kg/m2 FTO 222476
BMI rs10968576 0.11 0.0204081632653061 G A 0.31 3e-13 kg/m2 LRRN6C 247166
BMI rs2241423 0.13 0.0204081632653061 G A 0.78 1e-18 kg/m2 LBXCOR1 227886
```

The exact path to the file will be different on everyone's computer, but it can be located like this:

```{r eval=TRUE}
bmi_file <- system.file("extdata", "bmi.txt", package = "TwoSampleMR")
```

You can read the data in like this:

```{r eval=TRUE}
bmi_exp_dat <- read_exposure_data(bmi_file)
head(bmi_exp_dat)
```

The output from this function is a new data frame with standardised column names:

- `SNP`
- `exposure`
- `beta.exposure`
- `se.exposure`
- `effect_allele.exposure`
- `other_allele.exposure`
- `eaf.exposure`
- `mr_keep.exposure`
- `pval.exposure`
- `pval_origin.exposure`
- `id.exposure`
- `data_source.exposure`
- `units.exposure`
- `gene.exposure`
- `samplesize.exposure`

The function attempts to match the columns to the ones it expects. It also checks that the data type is as expected.

If the required data for MR to be performed is not present (SNP name, effect size, standard error, effect allele) for a particular SNP, then the column `mr_keep.exposure` will be `FALSE`. 

### Example 2: The text file has non-default column names

If the text file does not have default column names, this can still be read in as follows. Here are the first few rows of an example:

```
rsid,effect,SE,a1,a2,a1_freq,p-value,Units,Gene,n
rs10767664,0.19,0.030612245,A,T,0.78,5.00E-26,kg/m2,BDNF,225238
rs13078807,0.1,0.020408163,G,A,0.2,4.00E-11,kg/m2,CADM2,221431
rs1514175,0.07,0.020408163,A,G,0.43,8.00E-14,kg/m2,TNNI3K,207641
rs1558902,0.39,0.020408163,A,T,0.42,5.00E-120,kg/m2,FTO,222476
```

Note that this is a CSV file, with commas separating fields. The file is located here:

```{r}
bmi2_file <- system.file("extdata/bmi.csv", package = "TwoSampleMR")
```

To read in this data:

```{r}
bmi_exp_dat <- read_exposure_data(
	filename = bmi2_file,
	sep = ",",
	snp_col = "rsid",
	beta_col = "effect",
	se_col = "SE",
	effect_allele_col = "a1",
	other_allele_col = "a2",
	eaf_col = "a1_freq",
	pval_col = "p-value",
	units_col = "Units",
	gene_col = "Gene",
	samplesize_col = "n"
)
head(bmi_exp_dat)
```

If the `Phenotype` column is not provided (as is the case in this example) then it will assume that the phenotype's name is simply "exposure". This is entered in the `exposure` column. It can be renamed manually:

```{r}
bmi_exp_dat$exposure <- "BMI"
```

## Using an existing data frame

If the data already exists as a data frame in R then it can be converted into the correct format using the `format_data()` function. For example, here is some randomly created data:

```{r}
random_df <- data.frame(
  SNP = c("rs1", "rs2"),
  beta = c(1, 2),
  se = c(1, 2),
  effect_allele = c("A", "T")
)
random_df
```

This can be formatted like so:

```{r}
random_exp_dat <- format_data(random_df, type = "exposure")
random_exp_dat
```

## Obtaining instruments from existing catalogues

A number of sources of instruments have already been curated and are available for use. They are provided as data objects in the `MRInstruments` package. To install:

```{r eval=FALSE}
remotes::install_github("MRCIEU/MRInstruments")
```

This package contains a number of data.frames, each of which is a repository of SNP-trait associations. How to access the data frames is detailed below:

### GWAS catalog

The NHGRI-EBI GWAS catalog contains a catalog of significant associations obtained from GWASs. This version of the data is filtered and harmonised to contain associations that have the required data to perform MR, to ensure that the units used to report effect sizes from a particular study are all the same, and other data cleaning operations.

To use the GWAS catalog:

```{r}
library(MRInstruments)
data(gwas_catalog)
head(gwas_catalog)
```

For example, to obtain instruments for body mass index using the Speliotes et al 2010 study:

```{r eval=FALSE}
bmi_gwas <-
  subset(gwas_catalog,
         grepl("Speliotes", Author) &
           Phenotype == "Body mass index")
bmi_exp_dat <- format_data(bmi_gwas)
```

### Metabolites

Independent top hits from GWASs on `r length(unique(MRInstruments::metab_qtls$phenotype))` metabolites in whole blood are stored in the `metab_qtls` data object. Use `?metab_qtls` to get more information.

```{r}
data(metab_qtls)
head(metab_qtls)
```

For example, to obtain instruments for Alanine:

```{r}
ala_exp_dat <- format_metab_qtls(subset(metab_qtls, phenotype == "Ala"))
```

### Proteins

Independent top hits from GWASs on `r length(unique(MRInstruments::proteomic_qtls$analyte))` protein levels in whole blood are stored in the `proteomic_qtls` data object. Use `?proteomic_qtls` to get more information.

```{r}
data(proteomic_qtls)
head(proteomic_qtls)
```

For example, to obtain instruments for the ApoH protein:

```{r warning=FALSE}
apoh_exp_dat <-
  format_proteomic_qtls(subset(proteomic_qtls, analyte == "ApoH"))
```

### Gene expression levels

Independent top hits from GWASs on `r length(unique(MRInstruments::gtex_eqtl$gene_name))` gene identifiers and in `r length(unique(MRInstruments::gtex_eqtl$tissue))` tissues are available from the GTEX study in `gtex_eqtl`. Use `?gtex_eqtl` to get more information.

```{r}
data(gtex_eqtl)
head(gtex_eqtl)
```

For example, to obtain instruments for the IRAK1BP1 gene expression levels in subcutaneous adipose tissue:

```{r}
irak1bp1_exp_dat <-
  format_gtex_eqtl(subset(
    gtex_eqtl,
    gene_name == "IRAK1BP1" & tissue == "Adipose Subcutaneous"
  ))
```

### DNA methylation levels

Independent top hits from GWASs on `r length(unique(MRInstruments::aries_mqtl$gene_name))` DNA methylation levels in whole blood across `r length(unique(MRInstruments::aries_mqtl$timepoint))` time points are available from the ARIES study in `aries_mqtl`. Use `?aries_mqtl` to get more information.

```{r}
data(aries_mqtl)
head(aries_mqtl)
```

For example, to obtain instruments for cg25212131 CpG DNA methylation levels in at birth:

```{r}
cg25212131_exp_dat <-
  format_aries_mqtl(subset(aries_mqtl, cpg == "cg25212131" &
                             age == "Birth"))
```

### IEU GWAS database

The IEU GWAS database contains the entire summary statistics for thousands of GWASs. You can browse them here: https://gwas.mrcieu.ac.uk/

You can use this database to define the instruments for a particular exposure. You can also use this database to obtain the effects for constructing polygenic risk scores using different p-value thresholds.

You can check the status of the API:

```{r, eval=FALSE}
ieugwasr::api_status()
```

To obtain a list and details about the available GWASs do the following:

```{r, eval=FALSE}
ao <- available_outcomes()
head(ao)
```

For information about authentication see <https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication>.

The `available_outcomes()` function returns a table of all the available studies in the database. Each study has a unique ID. e.g., You might obtain

```r
head(subset(ao, select = c(trait, id)))
#>           trait         id
#> 1 Schizophrenia ieu-b-5103
#> 2 Schizophrenia ieu-b-5102
#> 3 Schizophrenia ieu-b-5101
#> 4 Schizophrenia ieu-b-5100
#> 5 Schizophrenia ieu-b-5099
#> 6 Schizophrenia ieu-b-5098
```

To extract instruments for a particular trait using a particular study, for example to obtain SNPs for body mass index using the Locke et al. 2015 GIANT study, you specify the study ID as follows:

```{r, eval=FALSE}
bmi2014_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
```

```r
str(bmi2014_exp_dat)
#> 'data.frame':    79 obs. of  15 variables:
#>  $ pval.exposure         : num  2.18e-08 4.57e-11 5.06e-14 5.45e-10 1.88e-28 ...
#>  $ samplesize.exposure   : num  339152 339065 313621 338768 338123 ...
#>  $ chr.exposure          : chr  "1" "1" "1" "1" ...
#>  $ se.exposure           : num  0.003 0.0031 0.0087 0.0029 0.003 0.0037 0.0031 0.003 0.0038 0.003 ...
#>  $ beta.exposure         : num  -0.0168 0.0201 0.0659 0.0181 0.0331 0.0497 -0.0227 0.0221 0.0209 0.0175 ...
#>  $ pos.exposure          : int  47684677 78048331 110082886 201784287 72837239 177889480 49589847 96924097 164567689 181550962 ...
#>  $ id.exposure           : chr  "ieu-a-2" "ieu-a-2" "ieu-a-2" "ieu-a-2" ...
#>  $ SNP                   : chr  "rs977747" "rs17381664" "rs7550711" "rs2820292" ...
#>  $ effect_allele.exposure: chr  "G" "C" "T" "C" ...
#>  $ other_allele.exposure : chr  "T" "T" "C" "A" ...
#>  $ eaf.exposure          : num  0.5333 0.425 0.0339 0.5083 0.6083 ...
#>  $ exposure              : chr  "Body mass index || id:ieu-a-2" "Body mass index || id:ieu-a-2" "Body mass index || id:ieu-a-2" "Body mass index || id:ieu-a-2" ...
#>  $ mr_keep.exposure      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
#>  $ pval_origin.exposure  : chr  "reported" "reported" "reported" "reported" ...
#>  $ data_source.exposure  : chr  "igd" "igd" "igd" "igd" ...
```

This returns a set of LD clumped SNPs that are GWAS significant for BMI. You can specify various parameters for this function:

- `p1` = P-value threshold for keeping a SNP
- `clump` = Whether or not to return independent SNPs only (default is `TRUE`)
- `r2` = The maximum LD R-square allowed between returned SNPs
- `kb` = The distance in which to search for LD R-square values

By changing changing the `p1` parameter it is possible to obtain SNP effects for constructing polygenic risk scores.

## Clumping

For standard two sample MR it is important to ensure that the instruments for the exposure are independent. Once instruments have been identified for an exposure variable, the IEU GWAS database can be used to perform clumping.

You can provide a list of SNP IDs, the SNPs will be extracted from 1000 genomes data, LD calculated between them, and amongst those SNPs that have LD R-square above the specified threshold only the SNP with the lowest P-value will be retained. To do this, use the following command:

```{r, eval=FALSE}
bmi_exp_dat <- clump_data(bmi2014_exp_dat)
```

```r
str(bmi_exp_dat)
#> 'data.frame':    30 obs. of  16 variables:
#>  $ SNP                   : chr  "rs10767664" "rs13078807" "rs1514175" "rs1558902" ...
#>  $ beta.exposure         : num  0.19 0.1 0.07 0.39 0.11 0.13 0.06 0.09 0.13 0.06 ...
#>  $ se.exposure           : num  0.0306 0.0204 0.0204 0.0204 0.0204 ...
#>  $ effect_allele.exposure: chr  "A" "G" "A" "A" ...
#>  $ other_allele.exposure : chr  "T" "A" "G" "T" ...
#>  $ eaf.exposure          : num  0.78 0.2 0.43 0.42 0.31 0.78 0.41 0.24 0.21 0.21 ...
#>  $ pval.exposure         : num  5e-26 4e-11 8e-14 5e-120 3e-13 ...
#>  $ units.exposure        : chr  "kg/m2" "kg/m2" "kg/m2" "kg/m2" ...
#>  $ gene.exposure         : chr  "BDNF" "CADM2" "TNNI3K" "FTO" ...
#>  $ samplesize.exposure   : int  225238 221431 207641 222476 247166 227886 209051 218439 209849 220081 ...
#>  $ exposure              : chr  "BMI" "BMI" "BMI" "BMI" ...
#>  $ mr_keep.exposure      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
#>  $ pval_origin.exposure  : chr  "reported" "reported" "reported" "reported" ...
#>  $ units.exposure_dat    : chr  "kg/m2" "kg/m2" "kg/m2" "kg/m2" ...
#>  $ id.exposure           : chr  "FXhiAH" "FXhiAH" "FXhiAH" "FXhiAH" ...
#>  $ data_source.exposure  : chr  "textfile" "textfile" "textfile" "textfile" ...
```

The `clump_data()` function takes any data frame that has been formatted to be an exposure data type of data frame. Note that for the instruments in the MRInstruments package the SNPs are already LD clumped.

**Note:** The LD reference panel only includes SNPs (no INDELs). There are five super-populations from which LD can be calculated, by default European samples are used. Only SNPs with MAF > 0.01 within-population are available.

**NOTE:** If a variant is dropped from your unclumped data it could be because it is absent from the reference panel. For more flexibility, including using your own LD reference data, see here: https://mrcieu.github.io/ieugwasr/