Mike Love’s blog

Sparse matrix operations in python and R

Posted in math, statistics by mikelove on April 16, 2012

I’m comparing two libraries for working with sparse matrices: scipy.sparse for python and Matrix for R.

I create a matrix of 10 million rows x 100 columns, filled randomly with 100 million standard normals (which overlap).

Python wins matrix creation, they are about equal for multiplication (XtX), python wins also taking row means, slicing 25 columns and then binding the columns together.
.
.

> python sparse.py -m 1e7 -n 1e2 -l 1e8
create: 19.51
multiply: 36.08
row means: 2.27
column slices: 4.88
hstack: 1.56

> R --no-save --slave '--args m=1e7 n=1e2 l=1e8' < sparse.R
create: 40.548
multipy: 40.034
row means: 9.327
column slices: 6.755
cbind: 20.167

.
.
(more…)

Beamer: serif font for math only

Posted in math by mikelove on March 16, 2011

Just what I’ve been looking for…

\documentclass[onlymath]{beamer}
\usefonttheme{serif}

Independence testing

Posted in math, statistics by mikelove on December 10, 2010

Measuring and testing dependence by correlation of distances

Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.

R/MATLAB package: energy

http://arxiv.org/abs/1010.0297

A smoothed bootstrap test for independence based on mutual information

A test for independence of multivariate time series based on the mutual information measure is proposed. First of all, a test for independence between two variables based on i.i.d. (time-independent) data is constructed and is then extended to incorporate higher dimensions and strictly stationary time series data. The smoothed bootstrap method is used to estimate the null distribution of mutual information. The experimental results reveal that the proposed smoothed bootstrap test performs better than the existing tests and can achieve high powers even for moderate dependence structures. Finally, the proposed test is applied to assess the actual independence of components obtained from independent component analysis (ICA).

Entropy

Posted in math by mikelove on October 18, 2010

Reading about mutual information between amino acids in protein-protein interaction pairs, and got distracted:

Although not particularly obvious from this equation, H(X) has a very concrete interpretation: Suppose x is chosen randomly from the distribution P_X(x), and someone who knows the distribution P_X(x) is asked to guess which x was chosen by asking only yes/no questions. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like “is x greater than x_0?”, then the average number of yes/no questions it takes to guess x lies between H(X) and H(X)+1 (Cover and Thomas, 1991). This gives quantitative meaning to “uncertainty”: it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy.

Scholarpedia: Mutual information

Measure of structure in residual matrix

Posted in math, statistics, visualization by mikelove on September 14, 2009

David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD.  I tried implementing the algorithm for a simulated dataset.  I generate n observations from k components of differing size in p dimensions, then add noise.

The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data.  This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.

The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data.  This matrix either still has some structure from remaining components or is just noise.  If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms).  The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix.  I then take the log of this, so the minimum is easier to see in a plot.

Here is the R code.

Here’s an easy one:

svd.structure2

Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components.  One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.

svd.structure

N-ball

Posted in math, statistics by mikelove on July 9, 2008

This is a bizarre feature of geometery that I forgot about: the volume of a the unit n-dimensional ball tends to zero as n increases past 5.  A unit n-ball (or hypersphere) is the set of points that can be reached by going 1 unit in any direction.  I think this is really counter-intuitive.

Image from Wolfram on ball.

Wikipedia on the curse of dimensionality says:

Thus, in some sense, nearly all of the high-dimensional space is “far away” from the centre, or, to put it another way, the high-dimensional unit space can be said to consist almost entirely of the “corners” of the hypercube, with almost no “middle”. (This is an important intuition for understanding the chi-squared distribution.)

Prime spirals, Hahn and Sacks

Posted in math, visualization by mikelove on June 28, 2008

This arxiv PDF has to win the insane diagram award for 2008.  Robert Sacks of the Sacks prime spiral and Harry Hahn point out some curiosities in the distribution of primes.  Lots of triple question marks:

Follow

Get every new post delivered to your Inbox.