I find myself always rewriting this code, so instead I will post it.

It is code to make a solid histogram with a thick outline, which is useful to overlay two empirical distributions or if you want to change the x axis to log for example.

x = rnorm(1000)
h = hist(x,breaks=20,plot=FALSE)
brks = rep(h$breaks,each=2)
counts = c(0,rep(h$counts,each=2),0)


Area vs length

I was listening to the TAL/Propublica report on the hedge fund Magnetar and went to their website which shows bubbles of the size of CDOs which Magnetar invested in vs. the market total of similar CDOs.

Long ago, Cleveland and McGill showed that people are better at comparing length than at comparing areas, but still you’ll see bubbles (saving space perhaps).  Below you can compare two presentations of the same data.

Here is one row from the original chart:

And the same data as a stacked barchart:

Also, it seems the black circle is the total size of the CDO which Magnetar had a part in creating, not the actual amount that Magnetar invested, but the legend has a black circle with just the word ‘Magnetar’ printed next to it.

Cancer genomes

There was a BBC article recently which highlighted lung cancer and melanoma sequencing experiments by teams led by the Wellcome Trust Sanger Institute. The summaries at Sanger are good introductions on how mutations accumulate in tumor cells.  They include a graphic of the two cancer genomes (lung at top, melanoma on bottom) and the different kinds of mutations involved:

From outside to inside: insertions or deletions (green); coding substitutions as numbers per 10 million (orange charts); coding substitutions (grey, purple, red and black marks); copy number (blue); deletions (red); rearrangements (green or purple)

“Nearly ten years on, we are still reaping the benefit from the first human genome sequence and we have much still to do to get to grips with these new disrupted landscapes of cancer genomes,” explains Dr Peter Campbell from the Wellcome Trust Sanger Institute. “But the knowledge we extract over the next few years will have major implications for treatment. By identifying all the cancer genes we will be able to develop new drugs that target the specific mutated genes and work out which patients will benefit from these novel treatments.”

Measure of structure in residual matrix

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:


Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.


Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points.  The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.

If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution.  There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).

# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")


test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic

One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided

# take a look at the Brownian bridge


# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }

# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
# numerical approximation
1 - kolm(D*sqrt(10))

Multiple lattice plots

Lattice is a package for R that makes Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs). Quick-R has an overview and examples of the different lattice plotting functions.  One quirk of the software is that lattice wants all data presented in the same format.  Quick-R: “~x|A means display numeric variable x for each level of factor A. y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.”

Lattice makes good looking plots, but you can’t use standard graphics parameters like mfrow=c(x,y) to show multiple plots at once. My lab mate Owen showed me a trick to get lattice to draw multiple plots for a number of numeric variables using melt() from the reshape package.  melt() expands a data frame into variable/value columns for one or more id columns.  Here’s an example of a data frame d, with 10 rows of numerical variables a,b,c and a condition factor equal to 1 or 2.



d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

stripplot(value~cond|variable,melt(d, id=”cond”),scales=list(relation=”free”))

Setting the scales parameter allows the panels to have different scales for their axes; setting the parameter as.table=TRUE would have plotted from top left to bottom right.


You can also get the same plots as above using featurePlot in the caret package: featurePlot(x, y).

To get density plots of the predictor variables a,b,c: