Archive for the 'statistics' Category

HT Sequencing

October 6, 2009

Here’s a good reference manual for using R and Bioconductor to work with high throughput sequencing data.  The slides are also a good introduction to the sequencing methods.

High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets.

Measure of structure in residual matrix

September 14, 2009

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:

svd.structure2

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

svd.structure

Kolmogorov-Smirnov test

September 10, 2009

The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points.  The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.

If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution.  There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).

# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")
lines(-30:30/10,pnorm(-30:30/10))

ecdf1

ks.test(x,"pnorm",exact=FALSE)
test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic

One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided

# take a look at the Brownian bridge
library(sde)
plot(BBridge(N=5000))
abline(h=0,lty=2)

ecdf2

# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }
lines(1:200/100,kolm(1:200/100),col="red")

ecdf3
# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
0.531
# numerical approximation
1 - kolm(D*sqrt(10))
0.5321873

Regulatory networks

August 27, 2009

I’ve been reading a few papers on computational methods for finding regulatory networks from gene expression data. From “Module networks revisited”:

Following Hartwell et al. (1999) a ‘module’ is to be viewed as a discrete entity composed of many types of molecules and whose function is separable from that of other modules. Understanding the general principles that determine the structure and function of modules and the parts they are composed of can be considered one of the main problems of contemporary systems biology (Hartwell et al., 1999). The module network method of Segal et al. (2003) addresses this problem using gene expression data as its input. It has yielded novel biological insights in a number of complex eukaryotic systems (…) and has been the source of inspiration for numerous computational approaches to network inference as evidenced by its high number of citations.

The JMLR paper applies the module/regulator framework to stocks as well as genes.  The Module Networks algorithm is an EM algorithm that iterates between 1) building regression trees using known regulator genes to predict the expression of a module of other genes and 2) reassigning genes to modules which have a better likelihood of explaining their variance.  There are many subtleties to the implementation including preventing cyclical graphs between regulators and modules, merging and splitting of modules, and smaller updates which refit the distributions at the leaves of the regression tree without rebuilding the tree.

ng1165-F1

Papers:

  • Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data (2003)
    http://www.nature.com/ng/journal/v34/n2/abs/ng1165.html
  • Learning Module Networks (2005)
    http://jmlr.csail.mit.edu/papers/v6/segal05a.html
  • Computational methods for discovering gene networks from expression data (2009)
    http://bib.oxfordjournals.org/cgi/content/full/10/4/408
  • Module networks revisited: computational assessment and prioritization of model predictions (2009)
    http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/4/490?ijkey=86c3db23239abe8fe69fe4650e8da36738ae72de&keytype2=tf_ipsecsha

Multiple lattice plots

August 11, 2009

Lattice is a package for R that makes Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs). Quick-R has an overview and examples of the different lattice plotting functions.  One quirk of the software is that lattice wants all data presented in the same format.  Quick-R: “~x|A means display numeric variable x for each level of factor A. y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.”

Lattice makes good looking plots, but you can’t use standard graphics parameters like mfrow=c(x,y) to show multiple plots at once. My lab mate Owen showed me a trick to get lattice to draw multiple plots for a number of numeric variables using melt() from the reshape package.  melt() expands a data frame into variable/value columns for one or more id columns.  Here’s an example of a data frame d, with 10 rows of numerical variables a,b,c and a condition factor equal to 1 or 2.

library(lattice)

library(reshape)

d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

stripplot(value~cond|variable,melt(d, id=”cond”),scales=list(relation=”free”))

Setting the scales parameter allows the panels to have different scales for their axes; setting the parameter as.table=TRUE would have plotted from top left to bottom right.

lattice_example

You can also get the same plots as above using featurePlot in the caret package: featurePlot(x, y).

To get density plots of the predictor variables a,b,c:

densityplot(~value|variable,melt(d),scales=list(relation=”free”))

densityexample

Multiple testing and false discovery rate

May 21, 2009

Here is a quick explanation of False Discovery Rates from Brad Efron’s Large-scale simultaneous hypothesis testing: The choice of a null hypothesis:

We begin with a simple Bayes model. Suppose that the N z-values fall into two classes, “Uninteresting” or “Interesting”, corresponding to whether or not z_i is generated according to the null hypothesis, with prior probabilities p0 and p1 = 1 − p0 , for the classes; and that z_i has density either f_0(z) or f_1(z) depending on its class,

p0 = Prob{Uninteresting}, f_0(z) density if Uninteresting (Null)

p1 = Prob{Interesting}, f_1(z) density if Interesting (Non-Null) .

The smooth curve in Figure 1 estimates the mixture density f(z),

f(z) = p0 * f_0(z) + p1 * f_1(z) .

According to Bayes theorem the a posteriori probability of being in the Uninteresting class given z is

Prob{Uninteresting|z} = p0 * f_0(z)/f(z) .

Here we define the local false discovery rate to be

fdr(z) ≡ f_0(z)/f_(z) ,

ignoring the factor p0, so fdr(z) is an upper bound on Prob{Uninteresting|z}. In fact p0 can be roughly estimated, but we are assuming that p0 is near 1, say p0 ≥ 0.90, so fdr(z) is not a flagrant overestimator.

Dirichlet

May 7, 2009

Wikipedia articles on statistics are great.  I didn’t know the Dirichlet had a “balls in an urn” explanation.  But I’m not too surprised as everything can be explained by balls in urns.

http://en.wikipedia.org/wiki/Dirichlet_distribution

Pólya urn

Consider an urn containing balls of K different colors. Initially, the urn contains α1 balls of color 1, α2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with another ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in the urn will be distributed as Dir1,...,αk ).

Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the urn diminishes as the urn accumulates increasing numbers of balls. This “diminishing returns” effect can also help explain how large α values yield Dirichlet distributions with most of the probability mass concentrated around a single point on the simplex.