# Mike Love’s blog

## Block bootstrap

Posted in genetics, statistics by mikelove on July 28, 2012

In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might want to know how often is it raining in two cities at the same time, and if this is more than expected by chance. But it is more likely to rain on a given day if it was raining the day before, and this dependence will change the distribution of overlap expected by chance. In stochastics, this is a question of whether the process is ‘stationary‘.

One way out of the problem of estimating the distribution of overlap of two process by chance is the block bootstrap. Instead of randomly shifting features in the sequence (what I call naive permutation), you randomly build new sequences from large blocks of the original sequence. Then a distribution can be formed of overlap of features by chance. Here is a single bootstrap sample (top sequence) constructed in this manner from the data (bottom sequence).

## 2-mers

Posted in genetics by mikelove on December 2, 2011

Here are the frequencies of 2-mers in the human genome (hg19).

(obtained using the count-words program of RSAT)

One line stands out due to a historic accumulation of certain mutations, called CG suppression.

```seq identifier observed_freq occ aa aa 0.0977693510124 279490734 ac ac 0.0503391220503 143903156 ag ag 0.0699208325000 199880889 at at 0.0772705679279 220891389 ca ca 0.0725344058342 207352244 cc cc 0.0520831825569 148888857 cg cg 0.0098517609035 28162976 ct ct 0.0699588753085 199989641 ga ga 0.0593289285247 169602085 gc gc 0.0426523912572 121929296 gg gg 0.0521099551064 148965391 gt gt 0.0504530230976 144228762 ta ta 0.0656671783246 187721077 tc tc 0.0593535812105 169672559 tg tg 0.0726616984034 207716132 tt tt 0.0980451459821 280279142```

## Reproducible

Posted in genetics, statistics by mikelove on December 18, 2010

It’s great when you can easily reproduce statistical analysis in a paper because 1) the authors made their data available and their methods clear, and 2) the existence of open source software for bioinformatics like Bioconductor. I wanted to use the data from a July 2010 paper in Genome Research: MicroRNA, mRNA, and protein expression link development and aging in human and macaque brain (Somel, M.). The article mentions the data was submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under series accession no. GSE18069. Going to this page then you can find the mRNA expression at series number GSE17757.

```library(GEOquery) library(Biobase) e = getGEO("GSE17757")[[1]] dim(exprs(e)) [1] 13317 51 ```

So here’s the first figure in the paper, showing the first two principal components of mRNA expression separate the development and aging phases of the 23 humans.

The first two principal components of mRNA … expression in human and rhesus macaque brains. The analysis was performed by singular value decomposition, using the ‘‘prcomp’’ function in the R ‘‘stats’’ package, with each gene scaled to unit variance before analysis.

```nas = apply(exprs(e[,pData(e)\$organism_ch=="Homo sapiens"]),1,function(x) sum(is.na(x))) e2 = e[nas==0,pData(e)\$organism_ch=="Homo sapiens"] years = as.numeric(sub("age: (\\d+) day[s]*", "\\1", as.character(pData(e2)\$characteristics_ch1.1)))/365 batch = pData(e2)\$characteristics_ch1.3 == "batch: human batch 2" exprs(e2)[,batch] = t(scale(t(exprs(e2)[,batch]))) exprs(e2)[,!batch] = t(scale(t(exprs(e2)[,!batch]))) pc = prcomp(t(exprs(e2))) plot(pc\$x[,1],pc\$x[,2],pch=16,cex=.5,xlim=1.2*range(pc\$x[,1]),ylim=1.2*range(pc\$x[,2]),xlab="PC1",ylab="PC2") text(pc\$x[,1],pc\$x[,2]+10,round(years)) ```

## PCA in population genetics

Posted in genetics, statistics by mikelove on December 15, 2010

This is a great Nature paper from 2008 that a labmate Owen showed me. The punchline is that you have to be careful when interpreting the results from principal component analysis:

Interpreting principal component analyses of spatial population genetic variation
Nature Genetics 40, 646 – 649 (2008)
John Novembre & Matthew Stephens
http://www.nature.com/ng/journal/v40/n5/full/ng.139.html

Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events.

Because the basis for these interpretive guidelines is unclear, we performed simulations to investigate whether such specific migration events are necessary to explain the observed patterns. Specifically, we performed PCA on data simulated under equilibrium population genetic models without range expansions, assuming a constant homogeneous short-range migration process across both time and (two-dimensional) space. The results showed highly distinctive structure. For example, the first two PC maps show large-scale orthogonal gradients, and the next two show ‘saddle’ and ‘mound’ patterns.

## Genetic variation

Posted in genetics, statistics by mikelove on September 4, 2010

I hadn’t seen these numbers on genetic variation between and among populations before.  From a 2004 paper, Genetic variation, classification and ‘race’:

The average proportion of nucleotide differences between a randomly chosen pair of humans (i.e., average nucleotide diversity, or pi) is consistently estimated to lie between 1 in 1,000 and 1 in 1,500 (refs. 9,10). This proportion is low compared with those of many other species, from fruit flies to chimpanzees11, 12, reflecting the recent origin of our species from a small founding population13. The pi-value for Homo sapiens can be put into perspective by considering that humans differ from chimpanzees at only 1 in 100 nucleotides, on average14, 15. Because there are approximately three billion nucleotide base pairs in the haploid human genome, each pair of humans differs, on average, by two to three million base pairs.

Of the 0.1% of DNA that varies among individuals, what proportion varies among main populations? Consider an apportionment of Old World populations into three continents (Africa, Asia and Europe), a grouping that corresponds to a common view of three of the ‘major races’16, 17. Approximately 85−90% of genetic variation is found within these continental groups, and only an additional 10−15% of variation is found between them18,19, 20 (Table 1). In other words, ~90% of total genetic variation would be found in a collection of individuals from a single continent, and only ~10% more variation would be found if the collection consisted of Europeans, Asians and Africans. The proportion of total genetic variation ascribed to differences between continental populations, called FST, is consistent, regardless of the type of autosomal loci examined (Table 1). FST varies, however, depending on how the human population is divided. If four Old World populations (European, African, East Asian and Indian subcontinent) are examined instead of three, FST(estimated for 100 Alu element insertion polymorphisms) decreases from 14% to 10% (ref.21). These estimates of FST and pi tell us that humans vary only slightly at the DNA level and that only a small proportion of this variation separates continental populations.

## Federal judge strikes down gene patents

Posted in genetics by mikelove on March 30, 2010

From the NYT:

A federal judge on Monday struck down patents on two genes linked to breast and ovarian cancer. The decision, if upheld, could throw into doubt the patents covering thousands of human genes and reshape the law of intellectual property.

Judge Sweet, however, ruled that the patents were “improperly granted” because they involved a “law of nature.” He said that many critics of gene patents considered the idea that isolating a gene made it patentable “a ‘lawyer’s trick’ that circumvents the prohibition on the direct patenting of the DNA in our bodies but which, in practice, reaches the same result.”

The case could have far-reaching implications. About 20 percent of human genes have been patented, and multibillion-dollar industries have been built atop the intellectual property rights that the patents grant.

The pro-patenting argument:

Edward Reines, a patent lawyer who represents biotechnology firms but was not involved in the case, said loss of patent protection could diminish the incentives for genetic research.

“The genetic tools to solve the major health problems of our time have not been found yet,” said Mr. Reines, who is with the Silicon Valley office of the firm Weil, Gotshal & Manges. “These are the discoveries we want to motivate by providing incentives to all the researchers out there.”

I’m very skeptical that biotech firms need to patent genes to have incentive for research. Firms and schools should be content to patent novel inventions and processes related to the study of genetics, not correlations between genes and diseases.

Also, a WSJ article.

The 152 page decision of the judge. (pdf)

## NYT on individual whole genomes

Posted in genetics, statistics by mikelove on March 11, 2010

The New York Times has an article, Disease Cause Is Pinpointed With Genome, which is a good overview of the status of whole genome sequencing for disease research.

Besides identifying disease genes, one team, in Seattle, was able to make the first direct estimate of the number of mutations, or changes in DNA, that are passed on from parent to child. They calculate that of the three billion units in the human genome, 60 per generation are changed by random mutation — considerably less than previously thought.

That study is by Roach in Science magazine.

On genome-wide associational studies:

And in most diseases the culprit DNA was linked to only a small portion of all the cases of the disease. It seemed that natural selection has weeded out any disease-causing mutation before it becomes common. The finding implies that common diseases, surprisingly, are caused by rare, not common, mutations.

…implying we need to do more fine-grained studies of genomes. On the cost of whole genome sequencing:

The family whose genomes they report in Science were sequenced by a company with a new DNA sequencing method, Complete Genomics of Mountain View, Calif., at a cost of \$25,000 each. Clifford Reid, the chief executive, said that the company was scaling up to sequence 500 genomes a month and that for large projects the price per genome would soon drop below \$10,000. “We are on our way to the \$5,000 genome,” he said.