The biomedical animations from the Walter + Eliza Hall Institute are available at Youtube. Except for some unnatural electronic sound effects and glowing auras, they seem more detailed, less cartoonish than other videos I’ve seen.
Forensic statistics
January 11, 2010In The Metaphysical Club, there’s a story about one of the first high visibility cases in which statistics were employed to make a point about the unlikeliness of some event happening, the Howland will forgery trial. From Wikipedia:
In the ensuing case of Robinson v. Mandell, Charles Sanders Peirce testified that he had made pairwise comparisons of 42 examples of Howland’s signature, overlaying them and counting the number of downstrokes that overlapped. Each signature featured 30 downstrokes and he concluded that, on average, 6 of the 30 overlapped, 1 in 5. Benjamin Peirce showed that the number of overlapping downstrokes between two signatures also closely followed the binomial distribution, the expected distribution if each downstroke was an independent event. When the admittedly genuine signature on the first page of the contested will was compared with that on the second, all 30 downstrokes coincided, suggesting that the second signature was a tracing of the first.
Benjamin Peirce, Charles’ father, then took the stand and asserted that, given the independence of each downstroke, the probability that all 30 downstrokes should coincide in two genuine signatures was 1/(2.666 × 1021) . He went on to observe:
“So vast improbability is practically an impossibility. Such evanescent shadows of probability cannot belong to actual life. They are unimaginably less than those least things which the law cares not for. … The coincidence which has occurred here must have had its origin in an intention to produce it. It is utterly repugnant to sound reason to attribute this coincidence to any cause but design.”
Stats
January 4, 2010I just finished a Statistics M.S. at Stanford. I’m actually surprised that I could have gone through high school and college without having taken statistics. I would place statistics far before calculus in terms of importance for daily life.
Some of the good books I’ve come across are:
- Mathematical Statistics and Data Analysis, by John Rice. This book is probably the best all purpose statistics reference, covering from basic probability theory to limit theorems, parameter estimation, tests and some Bayesian inference.
- The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. This is a good book for doing more advanced computational statistics. It is available as a pdf here: http://www-stat.stanford.edu/~tibs/ElemStatLearn/.
- Monte Carlo Strategies for Scientific Computing, by Jun Liu.
- Pattern Recognition and Machine Learning, by Christopher M. Bishop.
Classes with lecture notes or readings online:
- STAT 345, Computational Algorithms for Statistical Genetics, taught by Hua Tang and Nancy Zhang
http://www-stat.stanford.edu/~nzhang/Gen245/ - CS 229, Machine Learning, taught by Andrew Ng
http://www.stanford.edu/class/cs229/ - BMI 214, Representations and Algorithms for Computational Molecular Biology, taught by Russ Altman
http://www-helix.stanford.edu/bmi214/ - STAT 315c, Learning from Matrix Valued Data, taught by Art Owen
http://www-stat.stanford.edu/~owen/courses/315c/
Cancer genomes
December 23, 2009There was a BBC article recently which highlighted lung cancer and melanoma sequencing experiments by teams led by the Wellcome Trust Sanger Institute. The summaries at Sanger are good introductions on how mutations accumulate in tumor cells. They include a graphic of the two cancer genomes (lung at top, melanoma on bottom) and the different kinds of mutations involved:
From outside to inside: insertions or deletions (green); coding substitutions as numbers per 10 million (orange charts); coding substitutions (grey, purple, red and black marks); copy number (blue); deletions (red); rearrangements (green or purple)
“Nearly ten years on, we are still reaping the benefit from the first human genome sequence and we have much still to do to get to grips with these new disrupted landscapes of cancer genomes,” explains Dr Peter Campbell from the Wellcome Trust Sanger Institute. ”But the knowledge we extract over the next few years will have major implications for treatment. By identifying all the cancer genes we will be able to develop new drugs that target the specific mutated genes and work out which patients will benefit from these novel treatments.”
HT Sequencing
October 6, 2009Here’s a good reference manual for using R and Bioconductor to work with high throughput sequencing data. The slides are also a good introduction to the sequencing methods.
High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets.
Measure of structure in residual matrix
September 14, 2009David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD. I tried implementing the algorithm for a simulated dataset. I generate n observations from k components of differing size in p dimensions, then add noise.
The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data. This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.
The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data. This matrix either still has some structure from remaining components or is just noise. If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms). The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix. I then take the log of this, so the minimum is easier to see in a plot.
Here’s an easy one:

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components. One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

Kolmogorov-Smirnov test
September 10, 2009The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points. The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.
If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution. There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).
# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")
lines(-30:30/10,pnorm(-30:30/10))

ks.test(x,"pnorm",exact=FALSE)
test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic
One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided
# take a look at the Brownian bridge
library(sde)
plot(BBridge(N=5000))
abline(h=0,lty=2)

# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }
lines(1:200/100,kolm(1:200/100),col="red")

# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
0.531
# numerical approximation
1 - kolm(D*sqrt(10))
0.5321873

