Mike Love’s blog

duplicated() in R

Posted in statistics by mikelove on January 20, 2012

I have three vectors, and I want to use duplicated() to find the indices which are duplicates in all three vectors. The help warns that the function is potentially slow for lists.

The first two vectors are integers, less than 1e9 and less than 1e4, and the last one is a factor with three levels.


> system.time({duplicated(cbind(pos,width,strand))})
user system elapsed
37.034 0.000 37.042
> system.time({duplicated(data.frame(pos,width,strand))})
user system elapsed
5.142 0.001 5.145
> system.time({duplicated(paste(pos,width,strand,collapse=":"))})
user system elapsed
3.297 0.000 3.297
> system.time({duplicated(pos + width/1e5 + as.numeric(strand)/10)})
user system elapsed
0.276 0.002 0.278

The Anatomy of Influence

Posted in influence by mikelove on January 13, 2012

From the NYTimes book review of The Anatomy of Influence: Literature as a Way of Life by Harold Bloom:

“The critic’s role in all this was to map the secret genealogy, uncovering the true ancestor of the belated poet, difficult to do because strong poets ingeniously masked or concealed their actual influences.”

see also:

Bloom’s The Anxiety of Influence
Genealogy of Influence
Jonathan Lethem’s Ecstasy of Influence: A Plagiarism

R: table to HTML with dimension names

Posted in statistics by mikelove on December 16, 2011

The contingency tables printed by R nicely include the dimension names above the factor levels in the first row and first column. Here I have to replace spaces with dots to show how it is printed in the console.


shape=factor(c("round","round","round","spiky"))
color=factor(c("blue","red","blue","red"))
table(shape,color)

.......color
shape...blue.red
..round....2...1
..spiky....0...1

But if you try to print this with the xtable library, you lose the dimension names. This was mentioned also in this thread.


library(xtable)
print(xtable(table(shape,color)),"html")





blue red
round 2 1
spiky 0 1

I wrote a hack to force these dimension names into the resulting table. It’s ugly but it appears to work.


crazytable <- function(x,y,prop=0) {
if (prop == 0) {
xt <- as.data.frame(xtable(table(x,y)))
} else {
xt <- as.data.frame(xtable(prop.table(table(x,y),prop)))
}
cn <- colnames(xt)
rn <- rownames(xt)
t.out <- as.matrix(cbind(c("","",rn),rbind(rep("",ncol(xt)),cn,xt)))
colnames(t.out) <- 1:ncol(t.out)
rownames(t.out) <- 1:nrow(t.out)
t.out[2,1] <- deparse(substitute(x))
t.out[1,2] <- deparse(substitute(y))
t.out
}

print(xtable(crazytable(shape,color)),"html",include.r=FALSE,include.c=FALSE)






color
shape blue red
round 2 1
spiky 0 1

Here prop allows for proportion tables along rows (1) or columns (2).

2-mers

Posted in genetics by mikelove on December 2, 2011

Here are the frequencies of 2-mers in the human genome (hg19).

(obtained using the count-words program of RSAT)

One line stands out due to a historic accumulation of certain mutations, called CG suppression.

seq identifier observed_freq occ
aa aa 0.0977693510124 279490734
ac ac 0.0503391220503 143903156
ag ag 0.0699208325000 199880889
at at 0.0772705679279 220891389
ca ca 0.0725344058342 207352244
cc cc 0.0520831825569 148888857
cg cg 0.0098517609035 28162976
ct ct 0.0699588753085 199989641
ga ga 0.0593289285247 169602085
gc gc 0.0426523912572 121929296
gg gg 0.0521099551064 148965391
gt gt 0.0504530230976 144228762
ta ta 0.0656671783246 187721077
tc tc 0.0593535812105 169672559
tg tg 0.0726616984034 207716132
tt tt 0.0980451459821 280279142

Standard deviation

Posted in statistics by mikelove on November 1, 2011

I should really memorize this to baffle someone in the future:

E(s^2) = \sigma^2
E(s) =  \sigma [ 1 - \frac{1}{4n} - \frac{7}{32n^2} - \frac{19}{128n^3} + O(n^{-4}) ]

unbiased estimation of standard deviation

Log probabilities trick

Posted in statistics by mikelove on June 6, 2011

It is often the case in calculating likelihoods, that the probabilities can be too small for computational stability. Then it makes sense to work with the log of probabilities. However here one will immediately encounter another problem: how to get log(p + q) from log(p) and log(q)?

Richard Durbin et al. explain a trick for this in Biological sequence analysis: Probabilistic models of proteins and nucleic acids (1998), section 3.6.

\log(p + q) = \log(p ( 1 + \frac{q}{p}))
= \log(p) + \log(1 + \exp(\log(\frac{q}{p})))
= \log(p) + \log(1 + \exp(\log(q) - \log(p)))

Then if p is chosen as the larger of p and q, Durbin et al. argue that using a table of interpolations for calculating log(1 + exp(x)) gives very close estimates for log(p + q).

R one liner: Correlation matrices

Posted in statistics, visualization by mikelove on March 31, 2011

I have seen many plots of correlation matricies using rainbow or heat colors and therefore not indicating the zero crossing. E.g.

Instead I would like to see this:

library(fields)
cormat = cor(X)
image.plot(cormat,zlim=c(-1,1),col=colorRampPalette(c("red","white","green"))(49))

Follow

Get every new post delivered to your Inbox.