I’ve been experimenting with the NumPy Python package recently, which has fast and intuitive operations on arrays. I tried implementing the “nearest shrunken centroid method” of Tibshirani, Hastie, Narasimhan and Chu (2002) from the Predictive Analysis for Microarrays R package. The shrunken centroids classifier is a method for dealing with large numbers of noisy features. It is similar in some respects to penalized regression, in winnowing down to a subset of useful features.
In the case of gene expression data, the algorithm calculates class centroids, then shrinks each gene of the class centroids towards the overall centroid by a certain threshold. This step helps identify the smallest subset of genes that still gives predictive accuracy (using cross-validation). The link above has a good description and the original paper. Some graphs of the output using Matplotlib:
I posted the shrunken centroids python script on github and some sample data to run it on: (khan data, delete the .doc ending).


April 6, 2009 at 3:43 pm
Taken out of context, this post has a very absurdist title.