Multiple testing and false discovery rate

May 21, 2009

Here is a quick explanation of False Discovery Rates from Brad Efron’s Large-scale simultaneous hypothesis testing: The choice of a null hypothesis:

We begin with a simple Bayes model. Suppose that the N z-values fall into two classes, “Uninteresting” or “Interesting”, corresponding to whether or not z_i is generated according to the null hypothesis, with prior probabilities p0 and p1 = 1 − p0 , for the classes; and that z_i has density either f_0(z) or f_1(z) depending on its class,

p0 = Prob{Uninteresting}, f_0(z) density if Uninteresting (Null)

p1 = Prob{Interesting}, f_1(z) density if Interesting (Non-Null) .

The smooth curve in Figure 1 estimates the mixture density f(z),

f(z) = p0 * f_0(z) + p1 * f_1(z) .

According to Bayes theorem the a posteriori probability of being in the Uninteresting class given z is

Prob{Uninteresting|z} = p0 * f_0(z)/f(z) .

Here we define the local false discovery rate to be

fdr(z) ≡ f_0(z)/f_(z) ,

ignoring the factor p0, so fdr(z) is an upper bound on Prob{Uninteresting|z}. In fact p0 can be roughly estimated, but we are assuming that p0 is near 1, say p0 ≥ 0.90, so fdr(z) is not a flagrant overestimator.


Dirichlet

May 7, 2009

Wikipedia articles on statistics are great.  I didn’t know the Dirichlet had a “balls in an urn” explanation.  But I’m not too surprised as everything can be explained by balls in urns.

http://en.wikipedia.org/wiki/Dirichlet_distribution

Pólya urn

Consider an urn containing balls of K different colors. Initially, the urn contains α1 balls of color 1, α2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with another ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in the urn will be distributed as Dir1,...,αk ).

Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the urn diminishes as the urn accumulates increasing numbers of balls. This “diminishing returns” effect can also help explain how large α values yield Dirichlet distributions with most of the probability mass concentrated around a single point on the simplex.


Centroids

April 5, 2009

I’ve been experimenting with the NumPy Python package recently, which has fast and intuitive operations on arrays. I tried implementing the “nearest shrunken centroid method” of Tibshirani, Hastie, Narasimhan and Chu (2002) from the Predictive Analysis for Microarrays R package. The shrunken centroids classifier is a method for dealing with large numbers of noisy features. It is similar in some respects to penalized regression, in winnowing down to a subset of useful features.

In the case of gene expression data, the algorithm calculates class centroids, then shrinks each gene of the class centroids towards the overall centroid by a certain threshold. This step helps identify the smallest subset of genes that still gives predictive accuracy (using cross-validation).  The link above has a good description and the original paper. Some graphs of the output using Matplotlib:

shrunken4

shrunken5

I posted the shrunken centroids python script on github and some sample data to run it on: (khan data, delete the .doc ending).


NYT wage discrepancy graphic

March 19, 2009

interactiveSVG

March 17, 2009

interactiveSVG is an R package I made for adding ECMAScript interactions to SVG plots.  Flash is the main choice for data visualizations on webpages, but I like the idea of open formats and I think it’s great that you can read the source code of an SVG file.

The package uses the RSVGTipsDevice package that I posted here earlier, and does some scraping and rearranging of the SVG file.  ECMAScript onclick events do the work of rearranging SVG elements or updating a text box with a value.  I tried animations like easing at first, but noticed that FireFox, Safari and Opera all were pretty slow in rendering these.

Here’s an example SVG file of the 2010 U.S. Budget.  You can click the bars to get an updated total of the selection.  I got the budget numbers from this Wikipedia article.

interactivesvg

Also, try this pairs plot of the states dataset, with tooltips and zoomable subplots.

Tar file and example SVG files at http://mike-love.net/interactiveSVG/


Speed reading for mobiles

February 12, 2009

This is great.  I am a big fan of automated scrolling, and news aggregation.

Spreed News via infosthetics.


Colorful Graph Visualization

February 12, 2009

New set of Flare visualizations from Moritz Stefaner, who I posted about a while ago.

FlowingData sez: “this series of four visualizations – radial diagram, stacked, clustering, and network map – explore journal article citations.”

via Ranking and Mapping Scientific Knowledge – eigenfactor | FlowingData.