Centroids

April 5, 2009

I’ve been experimenting with the NumPy Python package recently, which has fast and intuitive operations on arrays. I tried implementing the “nearest shrunken centroid method” of Tibshirani, Hastie, Narasimhan and Chu (2002) from the Predictive Analysis for Microarrays R package. The shrunken centroids classifier is a method for dealing with large numbers of noisy features. It is similar in some respects to penalized regression, in winnowing down to a subset of useful features.

In the case of gene expression data, the algorithm calculates class centroids, then shrinks each gene of the class centroids towards the overall centroid by a certain threshold. This step helps identify the smallest subset of genes that still gives predictive accuracy (using cross-validation).  The link above has a good description and the original paper. Some graphs of the output using Matplotlib:

shrunken4

shrunken5

I posted the shrunken centroids python script on github and some sample data to run it on: (khan data, delete the .doc ending).

One Response to “Centroids”

  1. dorukakan Says:

    Taken out of context, this post has a very absurdist title.


Leave a Reply