False discovery rate

May 21, 2009

When doing thousands of simultaneous tests, such as in a gene association studies, the distribution of the resulting statistics does not always follow the theoretical distribution.  This could be because the tests are not independent samples of an identical t, for instance.  There are some methods of dealing with this, by fitting just the center of the observed values to estimate the distribution under the null hypothesis.  Then the tails of the fitted distribution can be used to estimate Prob{reject null | null}.  Here is a nice explanation of False Discovery Rates and from Brad Efron’s Large-scale simultaneous hypothesis testing: The choice of a null hypothesis:

We begin with a simple Bayes model. Suppose that the N z-values fall into two classes, “Uninteresting” or “Interesting”, corresponding to whether or not z_i is generated according to the null hypothesis, with prior probabilities p0 and p1 = 1 − p0 , for the classes; and that z_i has density either f_0(z) or f_1(z) depending on its class,

p0 = Prob{Uninteresting}, f_0(z) density if Uninteresting (Null)

p1 = Prob{Interesting}, f_1(z) density if Interesting (Non-Null) .

The smooth curve in Figure 1 estimates the mixture density f(z),

f(z) = p0 * f_0(z) + p1 * f_1(z) .

According to Bayes theorem the a posteriori probability of being in the Uninteresting class given z is

Prob{Uninteresting|z} = p0 * f_0(z)/f(z) .

Here we define the local false discovery rate to be

fdr(z) ≡ f_0(z)/f_(z) ,

ignoring the factor p0, so fdr(z) is an upper bound on Prob{Uninteresting|z}. In fact p0 can be roughly estimated, but we are assuming that p0 is near 1, say p0 ≥ 0.90, so fdr(z) is not a flagrant overestimator.


Dirichlet

May 7, 2009

Wikipedia articles on statistics are great.  I didn’t know the Dirichlet had a “balls in an urn” explanation.  But I’m not too surprised as everything can be explained by balls in urns.

http://en.wikipedia.org/wiki/Dirichlet_distribution

Pólya urn

Consider an urn containing balls of K different colors. Initially, the urn contains α1 balls of color 1, α2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with another ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in the urn will be distributed as Dir1,...,αk ).

Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the urn diminishes as the urn accumulates increasing numbers of balls. This “diminishing returns” effect can also help explain how large α values yield Dirichlet distributions with most of the probability mass concentrated around a single point on the simplex.


Shrunken centroids in Python

April 5, 2009

I’ve been using the NumPy Python package recently, specifically in implementing the shrunken centroids classifier used in the Predictive Analysis for Microarrays R package. NumPy + Matplotlib is great for doing interactive data analysis. NumPy has fast and intuitive operations on arrays.

I posted the shrunken centroids python script on github, and here is a link to the  khan data (delete the .doc ending). The shrunken centroids classifier is a method for dealing with large numbers of noisy features. It is similar in some respects to penalized regression, in winnowing down to a subset of useful features.

In the case of gene expression data, the algorithm calculates class centroids, then shrinks each gene of the class centroids towards the overall centroid by a certain threshold. This step helps identify the smallest subset of genes that still gives predictive accuracy (using cross-validation).  The link above has a good description and the original paper. Some graphs of the output using Matplotlib:

shrunken4

shrunken5

Interestingly, I also tried the R package randomForest to see if it would find a similar subset of genes and it did. When using the default settings and sorting genes by the ‘importance’ ranking, the top 20 were nearly the same set as generated by shrunken centroids.


NYT wage discrepancy graphic

March 19, 2009

interactiveSVG

March 17, 2009

interactiveSVG is an R package I made for adding ECMAScript interactions to SVG plots.  Flash is the main choice for data visualizations on webpages, but I like the idea of open formats and I think it’s great that you can read the source code of an SVG file.

The package uses the RSVGTipsDevice package that I posted here earlier, and does some scraping and rearranging of the SVG file.  ECMAScript onclick events do the work of rearranging SVG elements or updating a text box with a value.  I tried animations like easing at first, but noticed that FireFox, Safari and Opera all were pretty slow in rendering these.

Here’s an example SVG file of the 2010 U.S. Budget.  You can click the bars to get an updated total of the selection.  I got the budget numbers from this Wikipedia article.

interactivesvg

Also, try this pairs plot of the states dataset, with tooltips and zoomable subplots.

Tar file and example SVG files at http://mike-love.net/interactiveSVG/


Speed reading for mobiles

February 12, 2009

This is great.  I am a big fan of automated scrolling, and news aggregation.

Spreed News via infosthetics.


YACGV: Yet Another Colorful Graph Visualization

February 12, 2009

New set of Flare visualizations from Moritz Stefaner, who I posted about a while ago.

FlowingData sez: “this series of four visualizations – radial diagram, stacked, clustering, and network map – explore journal article citations.”

via Ranking and Mapping Scientific Knowledge – eigenfactor | FlowingData.