# Mike Love’s blog

## Sparse matrix operations in python and R

Posted in math, statistics by mikelove on April 16, 2012

I’m comparing two libraries for working with sparse matrices: scipy.sparse for python and Matrix for R.

I create a matrix of 10 million rows x 100 columns, filled randomly with 100 million standard normals (which overlap).

Python wins matrix creation, they are about equal for multiplication (XtX), python wins also taking row means, slicing 25 columns and then binding the columns together.
.
.

> python sparse.py -m 1e7 -n 1e2 -l 1e8
create: 19.51
multiply: 36.08
row means: 2.27
column slices: 4.88
hstack: 1.56

> R --no-save --slave '--args m=1e7 n=1e2 l=1e8' < sparse.R
create: 40.548
multipy: 40.034
row means: 9.327
column slices: 6.755
cbind: 20.167

.
.
(more…)

## For-loops and vectorized operations in Julia and R

Posted in statistics by mikelove on April 9, 2012

After seeing a lot of posts recently, I downloaded and tried out Julia, described by its creators as a “high-level, high-performance dynamic programming language for technical computing”. I am interested in comparing to Rcpp/inline packages as well, which I have had good experience with, and which offer all the comforts of the R environment (reading, writing, plotting).
Here are my examples for a purposely slow for-loop and a vectorized form in R:

########################################
picalc <- function(n) {
z <- numeric(n)
for (i in 1:n) {
x <- runif(2)
if (sum(x^2) < 1) z[i] <- 4
}
mean(z)
}

picalcvec <- function(n) {
x <- runif(n)
y <- runif(n)
4 * sum(x^2 + y^2 < 1) / n
}

print("for loop")
picalc(1e6)
system.time({picalc(1e6)})
print("vectorized")
picalcvec(1e7)
system.time({picalcvec(1e7)})
########################################

.
.
.
and in Julia:
.
.
(more…)

## Combining p-values: Fisher’s method, sum of p-values, binomial

Posted in statistics by mikelove on March 12, 2012

I’ve been thinking about methods for combining p-values from independent tests. Fisher’s method is the classic one, but in general it helps to think about the problem in geometric terms. A set of n p-values is a point in an n-dimensional cube. Under the null hypothesis, these points are distributed uniformly in the cube. We want a function F that maps points in this volume to [0,1]. Another requirement is that the volume satisfying $F(x) < z$ should have volume z. We are mostly interested in the points near the origin, sets of small p-values. The question is how to specify which points are close to the origin.

[updated: a paper on this topic: Art Owen, KARL PEARSON’S META-ANALYSIS REVISITED, The Annals of Statistics (2009) (arxiv.org stanford.edu)]

Fisher’s method calculates the product of p-values, then takes the logarithm and multiplies by -2. Under the null hypothesis, we have:

$-2 \sum_{i=1}^{k} \log(p_i) \sim \chi_{2k}^{2}$

The Wikipedia page explains that the negative log of uniform random variables follows an exponential distribution. When scaled by 2 these follow a chi-squared distribution with 2 degrees of freedom, and a sum of chi-squared distributions is also chi-squared with the degrees of freedom summed. If the p-values are lower than expected under the null hypothesis, then the combined value from Fisher’s method will be in the upper tail. We can write this method in R:

fishersMethod = function(x) pchisq(-2 * sum(log(x)),df=2*length(x),lower=FALSE)

I was thinking of summing p-values as well, which would have different behavior to the log product. As described by John Cook, the sum of n random uniform variables follows a law which can be expressed fairly simply. The cumulative distribution functions is as follows (see John Cook’s blog for the density and reference)

$P(X \le x) = \frac{1}{n!} \sum_{k=0}^{n} -1^k \binom{n}{k} (x-k)_{+}^{n}$

And due to the Central Limit Theorem, as the number of random uniforms increases, the sum will converge in distribution to a normal distribution with mean n/2 and variance n/12. So we can also use this law to give a combined p-value to a set of p-values. If the p-values are lower than expected from the null, the sum would show up in the lower tail of the distribution. This cdf written as an R function:

## duplicated() in R

Posted in statistics by mikelove on January 20, 2012

I have three vectors, and I want to use duplicated() to find the indices which are duplicates in all three vectors. The help warns that the function is potentially slow for lists.

The first two vectors are integers, less than 1e9 and less than 1e4, and the last one is a factor with three levels.

> system.time({duplicated(cbind(pos,width,strand))})
user system elapsed
37.034 0.000 37.042
> system.time({duplicated(data.frame(pos,width,strand))})
user system elapsed
5.142 0.001 5.145
> system.time({duplicated(paste(pos,width,strand,collapse=":"))})
user system elapsed
3.297 0.000 3.297
> system.time({duplicated(pos + width/1e5 + as.numeric(strand)/10)})
user system elapsed
0.276 0.002 0.278

## The Anatomy of Influence

Posted in influence by mikelove on January 13, 2012

From the NYTimes book review of The Anatomy of Influence: Literature as a Way of Life by Harold Bloom:

“The critic’s role in all this was to map the secret genealogy, uncovering the true ancestor of the belated poet, difficult to do because strong poets ingeniously masked or concealed their actual influences.”

Bloom’s The Anxiety of Influence
Genealogy of Influence
Jonathan Lethem’s Ecstasy of Influence: A Plagiarism

## R: table to HTML with dimension names

Posted in statistics by mikelove on December 16, 2011

The contingency tables printed by R nicely include the dimension names above the factor levels in the first row and first column. Here I have to replace spaces with dots to show how it is printed in the console.

shape=factor(c("round","round","round","spiky"))
color=factor(c("blue","red","blue","red"))
table(shape,color)

.......color
shape...blue.red
..round....2...1
..spiky....0...1

## 2-mers

Posted in genetics by mikelove on December 2, 2011

Here are the frequencies of 2-mers in the human genome (hg19).

(obtained using the count-words program of RSAT)

One line stands out due to a historic accumulation of certain mutations, called CG suppression.

seq identifier observed_freq occ
aa aa 0.0977693510124 279490734
ac ac 0.0503391220503 143903156
ag ag 0.0699208325000 199880889
at at 0.0772705679279 220891389
ca ca 0.0725344058342 207352244
cc cc 0.0520831825569 148888857
cg cg 0.0098517609035 28162976
ct ct 0.0699588753085 199989641
ga ga 0.0593289285247 169602085
gc gc 0.0426523912572 121929296
gg gg 0.0521099551064 148965391
gt gt 0.0504530230976 144228762
ta ta 0.0656671783246 187721077
tc tc 0.0593535812105 169672559
tg tg 0.0726616984034 207716132
tt tt 0.0980451459821 280279142