In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might want to know how often is it raining in two cities at the same time, and if this is more than expected by chance. But it is more likely to rain on a given day if it was raining the day before, and this dependence will change the distribution of overlap expected by chance. In stochastics, this is a question of whether the process is ‘stationary‘.
One way out of the problem of estimating the distribution of overlap of two process by chance is the block bootstrap. Instead of randomly shifting features in the sequence (what I call naive permutation), you randomly build new sequences from large blocks of the original sequence. Then a distribution can be formed of overlap of features by chance. Here is a single bootstrap sample (top sequence) constructed in this manner from the data (bottom sequence).
Here are the frequencies of 2-mers in the human genome (hg19).
(obtained using the count-words program of RSAT)
One line stands out due to a historic accumulation of certain mutations, called CG suppression.
seq identifier observed_freq occ
aa aa 0.0977693510124 279490734
ac ac 0.0503391220503 143903156
ag ag 0.0699208325000 199880889
at at 0.0772705679279 220891389
ca ca 0.0725344058342 207352244
cc cc 0.0520831825569 148888857
cg cg 0.0098517609035 28162976
ct ct 0.0699588753085 199989641
ga ga 0.0593289285247 169602085
gc gc 0.0426523912572 121929296
gg gg 0.0521099551064 148965391
gt gt 0.0504530230976 144228762
ta ta 0.0656671783246 187721077
tc tc 0.0593535812105 169672559
tg tg 0.0726616984034 207716132
tt tt 0.0980451459821 280279142
It’s great when you can easily reproduce statistical analysis in a paper because 1) the authors made their data available and their methods clear, and 2) the existence of open source software for bioinformatics like Bioconductor. I wanted to use the data from a July 2010 paper in Genome Research: MicroRNA, mRNA, and protein expression link development and aging in human and macaque brain (Somel, M.). The article mentions the data was submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under series accession no. GSE18069. Going to this page then you can find the mRNA expression at series number GSE17757.
e = getGEO("GSE17757")[]
 13317 51
So here’s the first figure in the paper, showing the first two principal components of mRNA expression separate the development and aging phases of the 23 humans.
The first two principal components of mRNA … expression in human and rhesus macaque brains. The analysis was performed by singular value decomposition, using the ‘‘prcomp’’ function in the R ‘‘stats’’ package, with each gene scaled to unit variance before analysis.
nas = apply(exprs(e[,pData(e)$organism_ch=="Homo sapiens"]),1,function(x) sum(is.na(x)))
e2 = e[nas==0,pData(e)$organism_ch=="Homo sapiens"]
years = as.numeric(sub("age: (\\d+) day[s]*", "\\1", as.character(pData(e2)$characteristics_ch1.1)))/365
batch = pData(e2)$characteristics_ch1.3 == "batch: human batch 2"
exprs(e2)[,batch] = t(scale(t(exprs(e2)[,batch])))
exprs(e2)[,!batch] = t(scale(t(exprs(e2)[,!batch])))
pc = prcomp(t(exprs(e2)))
This is a great Nature paper from 2008 that a labmate Owen showed me. The punchline is that you have to be careful when interpreting the results from principal component analysis:
Interpreting principal component analyses of spatial population genetic variation
Nature Genetics 40, 646 – 649 (2008)
John Novembre & Matthew Stephens
Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events.
Because the basis for these interpretive guidelines is unclear, we performed simulations to investigate whether such specific migration events are necessary to explain the observed patterns. Specifically, we performed PCA on data simulated under equilibrium population genetic models without range expansions, assuming a constant homogeneous short-range migration process across both time and (two-dimensional) space. The results showed highly distinctive structure. For example, the first two PC maps show large-scale orthogonal gradients, and the next two show ‘saddle’ and ‘mound’ patterns.
I hadn’t seen these numbers on genetic variation between and among populations before. From a 2004 paper, Genetic variation, classification and ‘race’:
The average proportion of nucleotide differences between a randomly chosen pair of humans (i.e., average nucleotide diversity, or pi) is consistently estimated to lie between 1 in 1,000 and 1 in 1,500 (refs. 9,10). This proportion is low compared with those of many other species, from fruit flies to chimpanzees11, 12, reflecting the recent origin of our species from a small founding population13. The pi-value for Homo sapiens can be put into perspective by considering that humans differ from chimpanzees at only 1 in 100 nucleotides, on average14, 15. Because there are approximately three billion nucleotide base pairs in the haploid human genome, each pair of humans differs, on average, by two to three million base pairs.
Of the 0.1% of DNA that varies among individuals, what proportion varies among main populations? Consider an apportionment of Old World populations into three continents (Africa, Asia and Europe), a grouping that corresponds to a common view of three of the ‘major races’16, 17. Approximately 85−90% of genetic variation is found within these continental groups, and only an additional 10−15% of variation is found between them18,19, 20 (Table 1). In other words, ~90% of total genetic variation would be found in a collection of individuals from a single continent, and only ~10% more variation would be found if the collection consisted of Europeans, Asians and Africans. The proportion of total genetic variation ascribed to differences between continental populations, called FST, is consistent, regardless of the type of autosomal loci examined (Table 1). FST varies, however, depending on how the human population is divided. If four Old World populations (European, African, East Asian and Indian subcontinent) are examined instead of three, FST(estimated for 100 Alu element insertion polymorphisms) decreases from 14% to 10% (ref.21). These estimates of FST and pi tell us that humans vary only slightly at the DNA level and that only a small proportion of this variation separates continental populations.
From the NYT:
A federal judge on Monday struck down patents on two genes linked to breast and ovarian cancer. The decision, if upheld, could throw into doubt the patents covering thousands of human genes and reshape the law of intellectual property.
Judge Sweet, however, ruled that the patents were “improperly granted” because they involved a “law of nature.” He said that many critics of gene patents considered the idea that isolating a gene made it patentable “a ‘lawyer’s trick’ that circumvents the prohibition on the direct patenting of the DNA in our bodies but which, in practice, reaches the same result.”
The case could have far-reaching implications. About 20 percent of human genes have been patented, and multibillion-dollar industries have been built atop the intellectual property rights that the patents grant.
The pro-patenting argument:
Edward Reines, a patent lawyer who represents biotechnology firms but was not involved in the case, said loss of patent protection could diminish the incentives for genetic research.
“The genetic tools to solve the major health problems of our time have not been found yet,” said Mr. Reines, who is with the Silicon Valley office of the firm Weil, Gotshal & Manges. “These are the discoveries we want to motivate by providing incentives to all the researchers out there.”
I’m very skeptical that biotech firms need to patent genes to have incentive for research. Firms and schools should be content to patent novel inventions and processes related to the study of genetics, not correlations between genes and diseases.
Also, a WSJ article.
The 152 page decision of the judge. (pdf)
The New York Times has an article, Disease Cause Is Pinpointed With Genome, which is a good overview of the status of whole genome sequencing for disease research.
Besides identifying disease genes, one team, in Seattle, was able to make the first direct estimate of the number of mutations, or changes in DNA, that are passed on from parent to child. They calculate that of the three billion units in the human genome, 60 per generation are changed by random mutation — considerably less than previously thought.
That study is by Roach in Science magazine.
On genome-wide associational studies:
And in most diseases the culprit DNA was linked to only a small portion of all the cases of the disease. It seemed that natural selection has weeded out any disease-causing mutation before it becomes common. The finding implies that common diseases, surprisingly, are caused by rare, not common, mutations.
…implying we need to do more fine-grained studies of genomes. On the cost of whole genome sequencing:
The family whose genomes they report in Science were sequenced by a company with a new DNA sequencing method, Complete Genomics of Mountain View, Calif., at a cost of $25,000 each. Clifford Reid, the chief executive, said that the company was scaling up to sequence 500 genomes a month and that for large projects the price per genome would soon drop below $10,000. “We are on our way to the $5,000 genome,” he said.