As part of a presentation on bias modeling for RNA-seq data, I put together a Shiny app that allows one to visually explore the differences between Salmon and kallisto on all samples for the RNA-seq datasets GEUVADIS and SEQC, the former a large RNA-seq project with annotated batches and sequencing centers and the latter a large benchmarking RNA-seq project. Running the two programs on these real, large datasets reveals many differences in transcript abundance quantification, in particular stability and reliability of abundance estimates across sequencing center.

For all datasets, we see that Salmon has more consistent estimation across the labs performing the experiment, as described in detail in the Salmon paper. You can click on the overview plots to explore the estimated counts by different methods for individual genes, where often the inconsistencies are driven by mis-estimation of which isoform is expressed within a gene.

The Salmon kallisto diffs GitHub repository where you can find the Shiny app comparing Salmon and kallisto abundance estimates on these benchmarking datasets.

But then I was asked to elaborate more so here goes:

TL/DR summary: I think DESeq2 and edgeR are both good methods for gene-level DE, as is limma-voom.

Aside: The bootstrapping idea within the sleuth method seems especially promising for transcript-level DE, but I have only kept up with the ideas and haven’t done any testing or benchmarking with sleuth yet, so I won’t include it in this post.

History: When we started working on DESeq2 in Fall 2012, one of the main differences we were focusing on was a methodological detail* of the dispersion shrinkage steps. Aside from this difference, we wanted to update DESeq to use the GLM framework and to shrink dispersion estimates toward a central value as in edgeR, as opposed to the maximum rule that was previously implemented in DESeq (which tends to overestimate dispersion).

I would say that the difference in dispersion shrinkage didn’t make a huge difference in performance compared to edgeR, as can be seen in real and simulated** data benchmarks in our DESeq2 paper published in 2014. From what I’ve seen in my own testing and numerous third-party benchmarks, the two methods typically report overlapping sets of genes, and have similar performance for gene-level DE testing.

Our paper was just published describing a new method for modeling and correcting fragment sequence bias for estimation of transcript abundances from RNA-seq:

GC content bias – affecting the amplification of fragments – is widespread in sequencing datasets and well-known, yet there were no existing transcript abundance estimation methods that properly corrected for this sample-specific bias. We identified hundreds of cases where top methods mis-identified the dominant isoform due to fragment-level GC bias being left uncorrected. While there are existing methods for post-hoc correction of gene abundance estimates using gene-level GC content, the transcript-level bias correction task is much more difficult. Sample-specific technical variation in coverage on small regions of transcripts leads to dramatic shifts in abundance estimates.

Q: What do you mean by “systematic errors”?

You can split any errors into systematic and stochastic components, with the former a fixed quantity and the latter varying but with zero mean. With the measurement of transcript expression, the stochastic part can be minimized by, for example, increasing sequencing depth and increasing the sample size: the number of biological units measured to infer a population mean expression value. We used the term “systematic” in the title to underscore that higher sequencing depth and more samples will not help to remove the bias we describe.

Q: Is there a video where you explain the gist of this?

Yes. Below is a 5 minute video where I cover the basic ideas presented in the paper.

Q: When should I correct for fragment GC bias? Are all my results wrong?

There are two situations where it’s critical to correct for GC bias for transcript-level analysis:

(1) The most problematic situation is when one is comparing abundances across groups of samples, and the groups have variable GC content dependence. This often happens when samples are processed in different labs or at different times (ideally experiments should not be designed this way, but it is nevertheless common in genomics research). We demonstrate that this can lead to thousands of false positives results for differential expression of transcripts, and these differences will often rise to the very top of a ranked list. The solutions are to use methods which produce GC-bias-corrected transcript abundance estimates, to use block experimental design, to include experimental batch as a covariate in statistical analysis, and to examine GC content dependence across samples using software like FASTQC and MultiQC.

(2) The second point of warning is when any of the samples in the dataset contain strong GC dependence, that is, the library preparation was not able to adequately amplify fragments with GC content < 35% or > 65%*. Such samples were present in a subset of the batches of all the datasets we examined in the paper. Even if all of the samples come from a single batch, or the experiment has a block design with multiple batches, simple descriptive analysis of transcript abundance (e.g. how many isoforms are expressed per gene, which isoforms are expressed) will be inaccurate for hundreds-to-thousands of genes when ignoring fragment GC bias. Furthermore, differential expression, if present, could be attributed to the wrong isoform or isoforms within genes.

* These guidelines come from the Discussion section of an excellent paper on technical reproducibility of RNA-seq by the GEUVADIS project: ‘t Hoen et al (2013).

Q: Where are the details of the alpine statistical method?

The statistical method is detailed in the Supplementary Note. This allowed us to have more fine grained control of the presentation of LaTeX equations. The methods in the Supplementary Note were originally in the main text and were refined during peer review.

Q: Is alpine the only method implementing fragment GC bias correction?

No, the latest version of Salmon (v0.7, with methods described in the bioRxiv preprint) also implements a fragment GC bias correction similar to alpine, which runs at a fraction of the time. Salmon with the gcBias flag similarly reduces the mis-estimation of isoforms caused by fragment GC bias (see examples in Supp. Figure 5 of the latest Salmon preprint).

Q: Is the Salmon implementation of GC bias correction better than alpine?

The GC bias correction model is probably about equal, but other aspects of the Salmon method are superior to alpine. The alpine method, when estimating transcript abundances, focuses on one gene or locus at a time, and does not account for multi-mapping fragments across genes. The Salmon implementation is a full, transcriptome-wide estimation method which can simultaneously estimate and correct for sample-specific fragment GC bias (and other biases as well).

Our focus in writing alpine was to make a super-extensible method for modeling RNA-seq biases, in order to make rigorous comparisons of various methods for bias correction, and then to show that fragment GC bias correction in particular leads to more accurate estimates of transcript abundance. By super-extensible, I mean that one can easily modify details of the bias correction specification: any combination of bias effects (fragment length distribution, positional, random hexamer sequence bias, GC bias), interactions between these, or modifications to the spline parameters used to fit positional and fragment GC bias. It is also possible to take empirical fragment GC bias curves from alpine and directly incorporate these into the Polyester RNA-seq simulator. This results in simulated RNA-seq data with variable coverage that looks more similar to real RNA-seq coverage.

Links to papers mentioned in video:

Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias (2011) https://www.ncbi.nlm.nih.gov/pubmed/21410973

Working on our PH525x online course material, Rafa and I wanted to base all lecture material in Rmd files, as these are easy for students to load into RStudio to walk through the code. Additionally, the rendered markdown files can be displayed nicely on GitHub Pages (which uses Jekyll to turn markdown in HTML). All we needed to do is copy the markdown files into the /pages directory and they already look pretty great when viewed on github.com (also we borrowed the theme from Karl Broman’s simple site template). By adding MathJax to the header of the GitHub Page template, we also render latex math formula from the original Rmd files.

Below I go over the issues we encountered with including math, but you can jump to the bottom if you just want our solution.

I just read Andrew Gelman’s post about an article with his name on it starting with an inaccurate definition of p-value. I sympathize with all parties. Journalists and editors are just trying to reduce technical terms by presenting layperson definitions. Earlier this year I caught a similar inaccurate definition on a site defining statistical terms for journalists. So admittedly, this wrong definition must be incredibly attractive to our minds for some reason.

A simple rule for testing if your definition is wrong:

If knowing the truth could make the p-value (as you have defined it) go to zero, then it is wrongly defined.

Let’s test:

wrong definition: p-value is the probability that a result is due to random chance

Suppose we know the truth, that the result is not due to random chance. Then the p-value as defined here should be zero. So this definition is wrong. The Wikipedia definition is too technical. I prefer the one from Gelman’s post:

right definition: “the probability of seeing something at least as extreme as the data, if the model […] were true”

where “model” refers to the boring model (e.g. the coin is balanced, the card guesser is not clairvoyant, the medicine is a sugar pill, etc.). This definition does not fail my test. We can calculate a probability under a model even when we know that model is not true.

Here is a bit of code for making a heatmap, which orders the rows of a matrix such that the first column (as ordered by in the dendrogram) has all 0s then all 1s, then the 2nd column is similarly ordered in two groups conditioning on the 1st column and so on. Hard to explain but easy to see in the picture below. I came up with raising to the power of 2 quickly, but then it took me a few minutes to realize I have to multiply the columns by the order of the order.

x <- replicate(10,sample(0:1,100,TRUE))
library(gplots)
hc <- hclust(dist(t(x)))
y <- sweep(x,2,2^(ncol(x)-order(hc$order)),"*")
z <- x[order(rowSums(y)),]
heatmap.2(z, trace="none", key=FALSE,
Rowv=FALSE,labRow=FALSE,
Colv=as.dendrogram(hc),
dendrogram="column",
scale="none",col=c("grey","blue"),
lwid=c(2,10))