Reproducible
It’s great when you can easily reproduce statistical analysis in a paper because 1) the authors made their data available and their methods clear, and 2) the existence of open source software for bioinformatics like Bioconductor. I wanted to use the data from a July 2010 paper in Genome Research: MicroRNA, mRNA, and protein expression link development and aging in human and macaque brain (Somel, M.). The article mentions the data was submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under series accession no. GSE18069. Going to this page then you can find the mRNA expression at series number GSE17757.
library(GEOquery)
library(Biobase)
e = getGEO("GSE17757")[[1]]
dim(exprs(e))
[1] 13317 51
So here’s the first figure in the paper, showing the first two principal components of mRNA expression separate the development and aging phases of the 23 humans.

The first two principal components of mRNA … expression in human and rhesus macaque brains. The analysis was performed by singular value decomposition, using the ‘‘prcomp’’ function in the R ‘‘stats’’ package, with each gene scaled to unit variance before analysis.
nas = apply(exprs(e[,pData(e)$organism_ch=="Homo sapiens"]),1,function(x) sum(is.na(x)))
e2 = e[nas==0,pData(e)$organism_ch=="Homo sapiens"]
years = as.numeric(sub("age: (\\d+) day[s]*", "\\1", as.character(pData(e2)$characteristics_ch1.1)))/365
batch = pData(e2)$characteristics_ch1.3 == "batch: human batch 2"
exprs(e2)[,batch] = t(scale(t(exprs(e2)[,batch])))
exprs(e2)[,!batch] = t(scale(t(exprs(e2)[,!batch])))
pc = prcomp(t(exprs(e2)))
plot(pc$x[,1],pc$x[,2],pch=16,cex=.5,xlim=1.2*range(pc$x[,1]),ylim=1.2*range(pc$x[,2]),xlab="PC1",ylab="PC2")
text(pc$x[,1],pc$x[,2]+10,round(years))

leave a comment