HT Sequencing

October 6, 2009

Here’s a good reference manual for using R and Bioconductor to work with high throughput sequencing data.  The slides are also a good introduction to the sequencing methods.

High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets.


Measure of structure in residual matrix

September 14, 2009

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:

svd.structure2

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

svd.structure


Kolmogorov-Smirnov test

September 10, 2009

The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points.  The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.

If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution.  There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).

# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")
lines(-30:30/10,pnorm(-30:30/10))

ecdf1

ks.test(x,"pnorm",exact=FALSE)
test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic

One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided

# take a look at the Brownian bridge
library(sde)
plot(BBridge(N=5000))
abline(h=0,lty=2)

ecdf2

# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }
lines(1:200/100,kolm(1:200/100),col="red")

ecdf3
# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
0.531
# numerical approximation
1 - kolm(D*sqrt(10))
0.5321873


Colors for scientific visualizations

September 2, 2009

An interesting article from IBM Research about color schemes for science visualizations.  The obvious recommendations are: if the zero-crossing (or crossing of any level) is important in the data, the color gradient should highlight this; low saturation colors can be better; and the rainbow color gradient is always the wrong choice.

The color article made me rethink one of the standard displays for microarray data: a heatmap with a green to red color gradient, green as low expression and red as high expression, rows as genes and columns as samples.

colors1

This is a problem for the ~5% people that are color blind, especially since red-green is one of the more common types of color blindness.  I’ve seen purple to yellow color gradients, but I’m trying cyan to yellow now, since it has comparable lightness on both ends:

colors2

R code:


cyan.yellow.cols <- rgb(c(rep(0,11),1:10/10),c(10:1/10,0,1:10/10),c(10:1/10,rep(0,11)),1)
green.red.cols <- rgb(c(rep(0,11),1:10/10),c(10:1/10,rep(0,11)),0,1)
exprs <- matrix(rnorm(100),ncol=10)
library(Heatplus)
heatmap_2(exprs,col=green.red.cols,scale="none",legend=1)
heatmap_2(exprs,col=cyan.yellow.cols,scale="none",legend=1)


The Obama slide

September 2, 2009

In a recent NYTimes op-ed, David Brooks argues that people are losing faith in the president:

The result is the Obama slide, the most important feature of the current moment. The number of Americans who trust President Obama to make the right decisions has fallen by roughly 17 percentage points. Obama’s job approval is down to about 50 percent. All presidents fall from their honeymoon highs, but in the history of polling, no newly elected American president has fallen this far this fast.

Anxiety is now pervasive. Trust in government rose when Obama took office. It has fallen back to historic lows. Fifty-nine percent of Americans now think the country is headed in the wrong direction.

This is just why I love Pollster and FiveThirtyEight: you can see how one person’s interpretation of a series of polls fits with the actual trend lines.  The upturn in “wrong track” responses is slight compared to the 80-90% of recent history:

polling

Other interesting polling charts: Nate Silver at FiveThirtyEight in a recent post uses historical data and hypothetical 2012 polls to argue that crossing the 50% barrier doesn’t doom a president in terms of midterm elections or re-election.


Regulatory networks

August 27, 2009

I’ve been reading a few papers on computational methods for finding regulatory networks from gene expression data. From “Module networks revisited”:

Following Hartwell et al. (1999) a ‘module’ is to be viewed as a discrete entity composed of many types of molecules and whose function is separable from that of other modules. Understanding the general principles that determine the structure and function of modules and the parts they are composed of can be considered one of the main problems of contemporary systems biology (Hartwell et al., 1999). The module network method of Segal et al. (2003) addresses this problem using gene expression data as its input. It has yielded novel biological insights in a number of complex eukaryotic systems (…) and has been the source of inspiration for numerous computational approaches to network inference as evidenced by its high number of citations.

The JMLR paper applies the module/regulator framework to stocks as well as genes.  The Module Networks algorithm is an EM algorithm that iterates between 1) building regression trees using known regulator genes to predict the expression of a module of other genes and 2) reassigning genes to modules which have a better likelihood of explaining their variance.  There are many subtleties to the implementation including preventing cyclical graphs between regulators and modules, merging and splitting of modules, and smaller updates which refit the distributions at the leaves of the regression tree without rebuilding the tree.

ng1165-F1

Papers:

  • Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data (2003)
    http://www.nature.com/ng/journal/v34/n2/abs/ng1165.html
  • Learning Module Networks (2005)
    http://jmlr.csail.mit.edu/papers/v6/segal05a.html
  • Computational methods for discovering gene networks from expression data (2009)
    http://bib.oxfordjournals.org/cgi/content/full/10/4/408
  • Module networks revisited: computational assessment and prioritization of model predictions (2009)
    http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/4/490?ijkey=86c3db23239abe8fe69fe4650e8da36738ae72de&keytype2=tf_ipsecsha

Multiple lattice plots

August 11, 2009

Lattice is a package for R that makes Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs). Quick-R has an overview and examples of the different lattice plotting functions.  One quirk of the software is that lattice wants all data presented in the same format.  Quick-R: “~x|A means display numeric variable x for each level of factor A. y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.”

Lattice makes good looking plots, but you can’t use standard graphics parameters like mfrow=c(x,y) to show multiple plots at once. My lab mate Owen showed me a trick to get lattice to draw multiple plots for a number of numeric variables using melt() from the reshape package.  melt() expands a data frame into variable/value columns for one or more id columns.  Here’s an example of a data frame d, with 10 rows of numerical variables a,b,c and a condition factor equal to 1 or 2.

library(lattice)

library(reshape)

d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

stripplot(value~cond|variable,melt(d, id=”cond”),scales=list(relation=”free”))

Setting the scales parameter allows the panels to have different scales for their axes; setting the parameter as.table=TRUE would have plotted from top left to bottom right.

lattice_example

You can also get the same plots as above using featurePlot in the caret package: featurePlot(x, y).

To get density plots of the predictor variables a,b,c:

densityplot(~value|variable,melt(d),scales=list(relation=”free”))

densityexample