Mike Love’s blog

Hist

Posted in statistics, visualization by mikelove on January 4, 2011

I find myself always rewriting this code, so instead I will post it.

It is code to make a solid histogram with a thick outline, which is useful to overlay two empirical distributions or if you want to change the x axis to log for example.

x = rnorm(1000)
h = hist(x,breaks=20,plot=FALSE)
brks = rep(h$breaks,each=2)
counts = c(0,rep(h$counts,each=2),0)
plot(brks,counts,type="n",xlab="Value",ylab="Frequency",main="Histogram")
lines(brks,counts,lwd=2)
polygon(brks,counts,col=rgb(1,1,0,.5))

Area vs length

Posted in visualization by mikelove on April 22, 2010

I was listening to the TAL/Propublica report on the hedge fund Magnetar and went to their website which shows bubbles of the size of CDOs which Magnetar invested in vs. the market total of similar CDOs.

Long ago, Cleveland and McGill showed that people are better at comparing length than at comparing areas, but still you’ll see bubbles (saving space perhaps).  Below you can compare two presentations of the same data.

Here is one row from the original chart:

And the same data as a stacked barchart:

Also, it seems the black circle is the total size of the CDO which Magnetar had a part in creating, not the actual amount that Magnetar invested, but the legend has a black circle with just the word ‘Magnetar’ printed next to it.

Biomedical animations

Posted in visualization by mikelove on January 25, 2010

The biomedical animations from the Walter + Eliza Hall Institute are available at Youtube.  Except for some unnatural electronic sound effects and glowing auras, they seem more detailed, less cartoonish than other videos I’ve seen.

Cancer genomes

Posted in genetics, statistics, visualization by mikelove on December 23, 2009

There was a BBC article recently which highlighted lung cancer and melanoma sequencing experiments by teams led by the Wellcome Trust Sanger Institute. The summaries at Sanger are good introductions on how mutations accumulate in tumor cells.  They include a graphic of the two cancer genomes (lung at top, melanoma on bottom) and the different kinds of mutations involved:

From outside to inside: insertions or deletions (green); coding substitutions as numbers per 10 million (orange charts); coding substitutions (grey, purple, red and black marks); copy number (blue); deletions (red); rearrangements (green or purple)

“Nearly ten years on, we are still reaping the benefit from the first human genome sequence and we have much still to do to get to grips with these new disrupted landscapes of cancer genomes,” explains Dr Peter Campbell from the Wellcome Trust Sanger Institute. “But the knowledge we extract over the next few years will have major implications for treatment. By identifying all the cancer genes we will be able to develop new drugs that target the specific mutated genes and work out which patients will benefit from these novel treatments.”

Measure of structure in residual matrix

Posted in math, statistics, visualization by mikelove on September 14, 2009

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:

svd.structure2

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

svd.structure

Kolmogorov-Smirnov test

Posted in statistics, visualization by mikelove on September 10, 2009

The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points.  The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.

If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution.  There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).

# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")
lines(-30:30/10,pnorm(-30:30/10))

ecdf1

ks.test(x,"pnorm",exact=FALSE)
test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic

One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided

# take a look at the Brownian bridge
library(sde)
plot(BBridge(N=5000))
abline(h=0,lty=2)

ecdf2

# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }
lines(1:200/100,kolm(1:200/100),col="red")

ecdf3
# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
0.531
# numerical approximation
1 - kolm(D*sqrt(10))
0.5321873

Multiple lattice plots

Posted in statistics, visualization by mikelove on August 11, 2009

Lattice is a package for R that makes Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs). Quick-R has an overview and examples of the different lattice plotting functions.  One quirk of the software is that lattice wants all data presented in the same format.  Quick-R: “~x|A means display numeric variable x for each level of factor A. y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.”

Lattice makes good looking plots, but you can’t use standard graphics parameters like mfrow=c(x,y) to show multiple plots at once. My lab mate Owen showed me a trick to get lattice to draw multiple plots for a number of numeric variables using melt() from the reshape package.  melt() expands a data frame into variable/value columns for one or more id columns.  Here’s an example of a data frame d, with 10 rows of numerical variables a,b,c and a condition factor equal to 1 or 2.

library(lattice)

library(reshape)

d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

stripplot(value~cond|variable,melt(d, id=”cond”),scales=list(relation=”free”))

Setting the scales parameter allows the panels to have different scales for their axes; setting the parameter as.table=TRUE would have plotted from top left to bottom right.

lattice_example

You can also get the same plots as above using featurePlot in the caret package: featurePlot(x, y).

To get density plots of the predictor variables a,b,c:

densityplot(~value|variable,melt(d),scales=list(relation=”free”))

densityexample

Follow

Get every new post delivered to your Inbox.