---
title: "Nonparametric bounds for the average causal effect: bpbounds examples"
author:
- name: "Tom Palmer"
  affiliation: "MRC IEU and Population Health Sciences, Bristol Medical School, University of Bristol"
  email: tom.palmer@bristol.ac.uk
- name: "Roland Ramsahai"
- name: "Vanessa Didelez"
  affiliation: "Leibniz BIPS and University of Bremen"
- name: "Nuala Sheehan"
  affiliation: "University of Leicester"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    keep_md: true
    fig_caption: yes
bibliography: bibliography.bib
vignette: >
  %\VignetteIndexEntry{Nonparametric bounds for the average causal effect: bpbounds examples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr", repos = "https://cloud.r-project.org/")
}
if (!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr", repos = "https://cloud.r-project.org/")
}
```

## Introduction
This short vignette demonstrates the use of the **bpbounds** package. This is a R implementation of the of the nonparametric bounds for the average causal effect of @balke-jasa-1997 and some extensions. Currently this R package is a port of our **bpbounds** Stata package (@palmer-sj-2011). The code implements the approach of calculating the bounds outlined by @ramsahai-uai-2007, @ramsahai-thesis, and @ramsahai-jmlr-2012.

We start by loading our package and the others needed for the code in this vignette.
```{r}
library(bpbounds)
library(dplyr)
library(tidyr)
```

## Features of the bpbounds package
Currently the package has one function, `bpbounds()`, which can accommodate the four scenarios implemented in our Stata command. These are to calculate the bounds for a binary outcome, a binary treatment/phenotype, and:

* a binary instrumental variable;
* a 3 category instrumental variable, e.g. a genotype in Mendelian randomization;

Bounds for these scenarios can be calculated with either

* trivariate data (in which all three variables are measured in one dataset);
* or bivariate data (in which the treatment/phenotype and instrument are measured in one dataset and the outcome and instrument are measured in another sample).


## Vitamin A supplementation example 
This example is taken from Table 1 of @balke-jasa-1997. It is from a study of Vitamin A supplementation consisting of children in 450 villages. Of these, 221 villages were assigned to control and 229 to the treatment. The outcome was mortality.

First we setup a `data.frame` of cell counts and convert this into a `table`. Here we follow the notation that `x` is the received treatment, `y` is the outcome, and `z` is the randomized treatment (the instrumental variable).

```{r}
tab1dat <- data.frame(
  z = c(0, 0, 1, 1, 1, 1, 0, 0),
  x = c(0, 0, 0, 0, 1, 1, 1, 1),
  y = c(0, 1, 0, 1, 0, 1, 0, 1),
  freq = c(74, 11514, 34, 2385, 12, 9663, 0, 0)
)

tab1inddat <- uncount(tab1dat, freq)
xt         <- xtabs(~ x + y + z, data = tab1inddat)
xt
```
Next we use `prop.table()` to calculate the conditional probabilities, $P(Y,X|Z)$, and then run `bpbounds()` assuming the data is trivariate. Although we could call `bpbounds()` using the table of cell counts `xt` directly, i.e. `bpbounds(xt)`.
```{r}
p <- prop.table(xt, margin = 3)
p
```

```{r}
bpres <- bpbounds(p)
sbp   <- summary(bpres)
print(sbp)
```
Therefore, the average causal effect is bounded between -0.1946 and 0.0054. The estimate of the ACE is found as 
$$\text{ACE} = \frac{\text{cov}(Y,Z)}{\text{cov}(X,Z)}$$
as follows.
```{r}
covyz <- cov(tab1inddat$y, tab1inddat$z)
covxz <- cov(tab1inddat$x, tab1inddat$z)
ace   <- covyz / covxz
ace
```

### Entering the data as conditional probabilities
If you already know the conditional probabilities you could pass them to `bpbounds()` as follows.
```{r}
condprob <- c(.0064, 0, .9936, 0, .0028, .001, .1972, .799)
tabp <- array(condprob,
              dim = c(2, 2, 2),
              dimnames = list(
                x = c(0, 1),
                y = c(0, 1),
                z = c(0, 1)
              )) %>%
  as.table()
bpbounds(tabp)
```

### Treating the data as bivariate
To demonstrate the features of the command we can treat this data as bivariate.
```{r}
gtab <- xtabs(~ y + z, data = tab1inddat)
gp   <- prop.table(gtab, margin = 2)
gp

ttab <- xtabs(~ x + z, data = tab1inddat)
tp   <- prop.table(ttab, margin = 2)
tp

bpres2 <- bpbounds(p = gp, t = tp, fmt = "bivariate")
sbp2   <- summary(bpres2)
print(sbp2)
```

## Mendelian randomization example
Mendelian randomization is an approach in epidemiology due to @daveysmith-ije-2003 in which genotypes established to be robustly associated with phenotypes are used as instrumental variables in order to better estimate the causal effect of the phenotype on a disease outcome.

This example uses data from @meleady-ajcn-2003. It is trivariate data with 3 category instrument and binary phenotype and outcomes. The instrument is the 677CT polymorphism (rs1801133) in the Methylenetetrahydrofolate Reductase gene, involved in folate metabolism, as an instrumental variable to investigate the effect of homocysteine on cardiovascular
disease. 

The data are presented to us as conditional probabilities, so we take care to enter them in the correct position in the vectors.
```{r}
mt3 <- c(.83, .05, .11, .01, .88, .06, .05, .01, .72, .05, .20, .03)
p3 <- array(mt3,
            dim = c(2, 2, 3),
            dimnames = list(
              x = c(0, 1),
              y = c(0, 1),
              z = c(0, 1, 2)
            ))
p3     <- as.table(p3)
bpres3 <- bpbounds(p3)
sbp3   <- summary(bpres3)
print(sbp3)
```
We see that in this example the monotonicity inequality is not satisfied.

## Simulated example that does not satisy the IV conditions 
This example recreates that found in section 8.3 of our Stata Journal paper which uses simulated data to show that the IV inequality does not always detect violations to the underlying model assumptions. We simulate two outcomes, `y1` and `y2`, for which the assumptions are violated because there is a direct effect of the instrument, `z`, on the outcome. However, the direct effect on `y2` is much smaller in magnitude than the direct effect on `y1`.

```{r}
set.seed(2232011)
n      <- 10000
z      <- rbinom(n, 1, .5)
u      <- rbinom(n, 1, .5)
px     <- .05 + .1 * z + .1 * u
x      <- rbinom(n, 1, px)
p1     <- .1 + .2 * z + .05 * x + .1 * u
y1     <- rbinom(n, 1, p1)
p2     <- .1 + .05 * z + .05 * x + .1 * u
y2     <- rbinom(n, 1, p2)

tab1   <- xtabs(~ x + y1 + z)
p1     <- prop.table(tab1, margin = 3)

bpres1 <- bpbounds(p1)
sbp1   <- summary(bpres1)
print(sbp1)
```
Due to the strong direct effect of the instrument on the outcome, both the instrumental variable and monotonicity inequalities are not satisifed.

```{r}
tab2   <- xtabs(~ x + y2 + z)
p2     <- prop.table(tab2, margin = 3)

bpres2 <- bpbounds(p2)
sbp2   <- summary(bpres2)
print(sbp2)
```
However, with the weaker direct effect we see that both are satisfied when the true underlying data generating model does not satisfy the instrumental variable assumptions.

## Conclusion
In conclusion, we have demonstrated the use of our **bpbounds** package implementing nonparametric bounds for the average causal effect for a range of data scenarios.

## References