Archive for the 'visualization' Category

Measure of structure in residual matrix

September 14, 2009

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:

svd.structure2

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

svd.structure

Kolmogorov-Smirnov test

September 10, 2009

The Kolmogorov-Smirnov test has an interesting derivation: to test if a sample of independent observations comes from a certain distribution, you line up the empirical cumulative distribution function with the theoretical CDF and find the greatest vertical distance between the two. If the sample comes from the specified distribution, then the max distance times the square root of the sample size has the same distribution as the maximum of the absolute value of the Brownian bridge, the Kolmogorov distribution. The Brownian bridge looks like Brownian motion/Weiner process except it is tethered at two points.  The test makes intuitive sense: under the null, the empirical and theoretical CDF’s are tethered at 0 and 1, and have some random variation in between.

If test statistic (distance * sqrt(n)) is above a certain quantile of the Kolmogorov distribution, then you reject the null hypothesis, and assume the sample is from a different distribution.  There is a simple alternative KS-test to see if two observed samples are from the same distribution (like the Mann-Whitney-Wilcoxon test).

# get 10 random normal observations
x <- rnorm(10)
plot(ecdf(x),verticals=TRUE,do.points=FALSE,col.hor="red",col.vert="red",xlab="",xlim=c(-3,3),lwd=2,main="empirical cdf")
lines(-30:30/10,pnorm(-30:30/10))

ecdf1

ks.test(x,"pnorm",exact=FALSE)
test <- ks.test(x,"pnorm",exact=FALSE)
D <- test$statistic

One-sample Kolmogorov-Smirnov test
D = 0.2553, p-value = 0.5322
alternative hypothesis: two-sided

# take a look at the Brownian bridge
library(sde)
plot(BBridge(N=5000))
abline(h=0,lty=2)

ecdf2

# try to approximate the Kolmogorov distribution
# with 1000 Brownian bridges
K <- replicate(1000,max(abs(BBridge(N=10000))))
plot(ecdf(K),verticals=TRUE,do.points=FALSE,main="distribution of K")
f <- function(x,i) { exp(-(2*i-1)^2*pi^2/(8*x^2)) }
kolm <- function(x) { sqrt(2*pi)/x*(f(x,1)+f(x,2)+f(x,3)+f(x,4)) }
lines(1:200/100,kolm(1:200/100),col="red")

ecdf3
# monte carlo approximation
1 - sum(K < D*sqrt(10))/1000
0.531
# numerical approximation
1 - kolm(D*sqrt(10))
0.5321873

Colors for scientific visualizations

September 2, 2009

An interesting article from IBM Research about color schemes for science visualizations.  The obvious recommendations are: if the zero-crossing (or crossing of any level) is important in the data, the color gradient should highlight this; low saturation colors can be better; and the rainbow color gradient is always the wrong choice.

The color article made me rethink one of the standard displays for microarray data: a heatmap with a green to red color gradient, green as low expression and red as high expression, rows as genes and columns as samples.

colors1

This is a problem for the ~5% people that are color blind, especially since red-green is one of the more common types of color blindness.  I’ve seen purple to yellow color gradients, but I’m trying cyan to yellow now, since it has comparable lightness on both ends:

colors2

R code:


cyan.yellow.cols <- rgb(c(rep(0,11),1:10/10),c(10:1/10,0,1:10/10),c(10:1/10,rep(0,11)),1)
green.red.cols <- rgb(c(rep(0,11),1:10/10),c(10:1/10,rep(0,11)),0,1)
exprs <- matrix(rnorm(100),ncol=10)
library(Heatplus)
heatmap_2(exprs,col=green.red.cols,scale="none",legend=1)
heatmap_2(exprs,col=cyan.yellow.cols,scale="none",legend=1)

The Obama slide

September 2, 2009

In a recent NYTimes op-ed, David Brooks argues that people are losing faith in the president:

The result is the Obama slide, the most important feature of the current moment. The number of Americans who trust President Obama to make the right decisions has fallen by roughly 17 percentage points. Obama’s job approval is down to about 50 percent. All presidents fall from their honeymoon highs, but in the history of polling, no newly elected American president has fallen this far this fast.

Anxiety is now pervasive. Trust in government rose when Obama took office. It has fallen back to historic lows. Fifty-nine percent of Americans now think the country is headed in the wrong direction.

This is just why I love Pollster and FiveThirtyEight: you can see how one person’s interpretation of a series of polls fits with the actual trend lines.  The upturn in “wrong track” responses is slight compared to the 80-90% of recent history:

polling

Other interesting polling charts: Nate Silver at FiveThirtyEight in a recent post uses historical data and hypothetical 2012 polls to argue that crossing the 50% barrier doesn’t doom a president in terms of midterm elections or re-election.

Multiple lattice plots

August 11, 2009

Lattice is a package for R that makes Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs). Quick-R has an overview and examples of the different lattice plotting functions.  One quirk of the software is that lattice wants all data presented in the same format.  Quick-R: “~x|A means display numeric variable x for each level of factor A. y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels.”

Lattice makes good looking plots, but you can’t use standard graphics parameters like mfrow=c(x,y) to show multiple plots at once. My lab mate Owen showed me a trick to get lattice to draw multiple plots for a number of numeric variables using melt() from the reshape package.  melt() expands a data frame into variable/value columns for one or more id columns.  Here’s an example of a data frame d, with 10 rows of numerical variables a,b,c and a condition factor equal to 1 or 2.

library(lattice)

library(reshape)

d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

stripplot(value~cond|variable,melt(d, id=”cond”),scales=list(relation=”free”))

Setting the scales parameter allows the panels to have different scales for their axes; setting the parameter as.table=TRUE would have plotted from top left to bottom right.

lattice_example

You can also get the same plots as above using featurePlot in the caret package: featurePlot(x, y).

To get density plots of the predictor variables a,b,c:

densityplot(~value|variable,melt(d),scales=list(relation=”free”))

densityexample

Centroids

April 5, 2009

I’ve been experimenting with the NumPy Python package recently, which has fast and intuitive operations on arrays. I tried implementing the “nearest shrunken centroid method” of Tibshirani, Hastie, Narasimhan and Chu (2002) from the Predictive Analysis for Microarrays R package. The shrunken centroids classifier is a method for dealing with large numbers of noisy features. It is similar in some respects to penalized regression, in winnowing down to a subset of useful features.

In the case of gene expression data, the algorithm calculates class centroids, then shrinks each gene of the class centroids towards the overall centroid by a certain threshold. This step helps identify the smallest subset of genes that still gives predictive accuracy (using cross-validation).  The link above has a good description and the original paper. Some graphs of the output using Matplotlib:

shrunken4

shrunken5

I posted the shrunken centroids python script on github and some sample data to run it on: (khan data, delete the .doc ending).

interactiveSVG

March 17, 2009

interactiveSVG is an R package I made for adding ECMAScript interactions to SVG plots.  Flash is the main choice for data visualizations on webpages, but I like the idea of open formats and I think it’s great that you can read the source code of an SVG file.

The package uses the RSVGTipsDevice package that I posted here earlier, and does some scraping and rearranging of the SVG file.  ECMAScript onclick events do the work of rearranging SVG elements or updating a text box with a value.  I tried animations like easing at first, but noticed that FireFox, Safari and Opera all were pretty slow in rendering these.

Here’s an example SVG file of the 2010 U.S. Budget.  You can click the bars to get an updated total of the selection.  I got the budget numbers from this Wikipedia article.

interactivesvg

Also, try this pairs plot of the states dataset, with tooltips and zoomable subplots.

Tar file and example SVG files at http://mike-love.net/interactiveSVG/