---
title: Illustrative examples for evaluating transmission linkage
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Illustrative examples for evaluating transmission linkage}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

In this vignette, we apply the functions provided to estimate sensitivity and specificity and apply them to an Ebola-like pathogen. We compare the results to those of a pathogen with a slower mutation rate and higher reproductive number. Then, we will use these results to calculate the false discovery rate of our linkage criteria (genetic distance) given a sample size of 200 and a relevant population of 1000 infected individuals.

In the first part of this vignette, we use the function `gendist_roc_format()`, which formats the results of `gendist_sensspec_cutoff()` for plotting ROC curves.

```{r setup, message=FALSE}
library(phylosamp)
library(ggplot2)
library(RColorBrewer)
library(cowplot)
```

### Calculating sensitivity and specificity for an Ebola-like pathogen

```{r gen_dist_compare_ebov, fig.width=6, fig.height=4, warning=FALSE}

# ebola-like pathogen
R <- 1.5
mut_rate <- 1

# use simulated generation distributions
data("genDistSim")
mean_gens_pdf <- as.numeric(genDistSim[genDistSim$R == R, -(1:2)])

# get theoretical genetic distance dist based on mutation rate and generation parameters
dists <- as.data.frame(gendist_distribution(mut_rate = mut_rate, 
                                            mean_gens_pdf = mean_gens_pdf, 
                                            max_link_gens = 1))

# reshape dataframe for plotting
dists <- reshape2::melt(dists, 
                        id.vars = "dist", 
                        variable.name = "status", 
                        value.name = "prob")

# get sensitivity and specificity using the same paramters
roc_calc <- gendist_roc_format(cutoff = 1:(max(dists$dist)-1), 
                               mut_rate = mut_rate,
                               mean_gens_pdf = mean_gens_pdf)
  
# get the optimal value for the ROC plot
roc_optim <- optim_roc_threshold(roc_calc)
threshold <- roc_optim$cutoff
optim_point <- c(roc_optim$specificity, roc_optim$sensitivity)

# plot the distributions of genetic distance
pal <- RColorBrewer::brewer.pal(5, "PuOr")
  
p1 <- ggplot(dists, aes(x=dist, y=prob, fill=status)) +
    geom_bar(alpha=0.5, stat="identity", position="identity") +
    scale_fill_manual(name="", values=c(pal[5], pal[2])) +
    scale_x_continuous(limits = c(-1,35), expand = c(0,0)) +
    scale_y_continuous(limits = c(0,0.4), expand = c(0,0)) +
    xlab('Genetic distance') + ylab('Density') +
    theme_classic() + theme(legend.position='none') +
    geom_vline(xintercept=threshold, linetype=2, size=0.5, color = "black")
  
# add ROC curve on top
r1 <- ggplot(data=roc_calc, aes(x=specificity, y=sensitivity)) +
    geom_line(size=1.5) + xlim(0,1) + ylim(0,1) +
    geom_point(x=optim_point[1], y=optim_point[2],
               size = 3, stroke=1, shape=21, fill = "chartreuse") +
    geom_abline(slope=1, intercept = 0, linetype=2, alpha=0.5) +
    xlab("1-specificity") +
    theme_bw()
  
cowplot::ggdraw(p1) + cowplot::draw_plot(r1, hjust=-0.16, vjust=-0.16, scale=0.6)

```

### Optimal values for an Ebola-like pathogen

Using the closest-to-corner method, the optimal genetic distance threshold for this pathogen is:
```{r optim_value}
threshold
```

And the sensitivity and specificity at this threshold is:
```{r sens_spec_values}
roc_optim$sensitivity ## sensitivity
1 - roc_optim$specificity ## specificity
```

### Calculating sensitivity and specificity for a faster-spreading, slower mutating pathogen

We repeat the above plot for a pathogen with a higher reproductive number and lower mutation rate:

```{r gen_dist_compare_mev}

# faster-spreading pathogen
R <- 3
mut_rate <- 0.2

mean_gens_pdf <- genDistSim[genDistSim["R"]==R][3:dim(genDistSim)[2]]

```

```{r gen_dist_mev_plot, fig.width=6, fig.height=4, echo=FALSE, warning=FALSE}

# get theoretical genetic distance dist based on mutation rate and generation parameters
dists <- as.data.frame(gendist_distribution(mut_rate = mut_rate, 
                                            mean_gens_pdf = mean_gens_pdf, max_link_gens = 1))
# reshape dataframe for plotting
dists <- reshape2::melt(dists, id.vars = "dist", variable.name = "status", value.name = "prob")

# get sensitivity and specificity using the same paramters
roc_calc <- gendist_roc_format(cutoff = 1:(max(dists$dist)-1), mut_rate = mut_rate,
                          mean_gens_pdf = mean_gens_pdf)
  
# get the optimal value for the ROC plot using closest-to-corner metric
roc_optim <- roc_calc
roc_optim$dist <- sqrt((1-roc_optim$sensitivity)^2 + (roc_optim$specificity)^2)
roc_optim <- roc_optim[2:(dim(roc_optim)[1]),] # remove first row with zero threshold
optim_value <- min(roc_optim$dist)
threshold <- roc_optim[roc_optim$dist==optim_value,]$cutoff
optim_point <- c(roc_optim[roc_optim$cutoff==threshold,]$specificity,
                 roc_optim[roc_optim$cutoff==threshold,]$sensitivity)
  
# plot the distributions of genetic distance
pal <- RColorBrewer::brewer.pal(5, "PuOr")
  
p1 <- ggplot(dists, aes(x=dist, y=prob, fill=status)) +
    geom_bar(alpha=0.5, stat="identity", position="identity") +
    scale_fill_manual(name="", values=c(pal[5], pal[2])) +
    scale_x_continuous(limits = c(-1,35), expand = c(0,0)) +
    scale_y_continuous(limits = c(0,0.4), expand = c(0,0)) +
    xlab('Genetic distance') + ylab('Density') +
    theme_classic() + theme(legend.position='none') +
    geom_vline(xintercept=threshold, linetype=2, size=0.5, color = "black")
  
# add ROC curve on top
r1 <- ggplot(data=roc_calc, aes(x=specificity, y=sensitivity)) +
    geom_line(size=1.5) + xlim(0,1) + ylim(0,1) +
    geom_point(x=optim_point[1], y=optim_point[2],
               size = 3, stroke=1, shape=21, fill = "chartreuse") +
    geom_abline(slope=1, intercept = 0, linetype=2, alpha=0.5) +
    xlab("1-specificity") +
    theme_bw()
  
cowplot::ggdraw(p1) + cowplot::draw_plot(r1, hjust=-0.16, vjust=-0.16, scale=0.6)

```

### Optimal values for a faster-spreading, slower mutating pathogen

For this pathogen, the optimal genetic distance threshold is:
```{r optim_value_slow}
threshold
```

And the sensitivity and specificity at this threshold is:
```{r sens_spec_values_slow}
roc_optim[roc_optim$cutoff==threshold,]$sensitivity ## sensitivity
1 - roc_optim[roc_optim$cutoff==threshold,]$specificity ## specificity
```

It is well understood that genetic distance is a much more accurate predictor of transmission linkage for pathogens with higher mutation rates and slower spread, as shown clearly in this example (lower sensitivity and specificity in the second example). Genetic distance is clearly not an appropriate metric for all pathogens, but more sophisticated linkage criteria -- especially those that incorporate epidemiological information as well -- may produce better results.

### Calculating the false discovery rate of genetic distance as a linkage criteria

This is reflected in our confidence in identifying linked infection pairs. Assuming a sample size of 200, a relevant population of 1000, and the genetic distance threshold and associated sensitivity and specificity for each pathogen we get:

```{r fdr_final}
translink_tdr(sensitivity=0.9963402, specificity=0.9801562, 
              rho=0.20, M=200, R=1) # ebola-like pathogen
translink_tdr(sensitivity=0.8187308, specificity=0.8539157, 
              rho=0.20, M=200, R=1) # slower-mutating pathogen
```

As a final note, we remind the reader that the specificity of the linkage criteria has a significant bearing on the true/false discovery rate, and that high true discovery rates require a specificity upwards of 0.999. Linkage criteria more complex than genetic distance may be needed to achieve this value.