I create a matrix of 10 million rows x 100 columns, filled randomly with 100 million standard normals (which overlap).
Python wins matrix creation, they are about equal for multiplication (XtX), python wins also taking row means, slicing 25 columns and then binding the columns together.
> python sparse.py -m 1e7 -n 1e2 -l 1e8 create: 19.51 multiply: 36.08 row means: 2.27 column slices: 4.88 hstack: 1.56 > R --no-save --slave '--args m=1e7 n=1e2 l=1e8' < sparse.R create: 40.548 multipy: 40.034 row means: 9.327 column slices: 6.755 cbind: 20.167
Just what I’ve been looking for…
Measuring and testing dependence by correlation of distances
Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.
R/MATLAB package: energy
A smoothed bootstrap test for independence based on mutual information
A test for independence of multivariate time series based on the mutual information measure is proposed. First of all, a test for independence between two variables based on i.i.d. (time-independent) data is constructed and is then extended to incorporate higher dimensions and strictly stationary time series data. The smoothed bootstrap method is used to estimate the null distribution of mutual information. The experimental results reveal that the proposed smoothed bootstrap test performs better than the existing tests and can achieve high powers even for moderate dependence structures. Finally, the proposed test is applied to assess the actual independence of components obtained from independent component analysis (ICA).
Reading about mutual information between amino acids in protein-protein interaction pairs, and got distracted:
Although not particularly obvious from this equation, has a very concrete interpretation: Suppose is chosen randomly from the distribution , and someone who knows the distribution is asked to guess which was chosen by asking only yes/no questions. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like “is greater than ?”, then the average number of yes/no questions it takes to guess lies between and (Cover and Thomas, 1991). This gives quantitative meaning to “uncertainty”: it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy.
David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD. I tried implementing the algorithm for a simulated dataset. I generate n observations from k components of differing size in p dimensions, then add noise.
The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data. This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.
The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data. This matrix either still has some structure from remaining components or is just noise. If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms). The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix. I then take the log of this, so the minimum is easier to see in a plot.
Here’s an easy one:
Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components. One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.
This is a bizarre feature of geometery that I forgot about: the volume of a the unit n-dimensional ball tends to zero as n increases past 5. A unit n-ball (or hypersphere) is the set of points that can be reached by going 1 unit in any direction. I think this is really counter-intuitive.
Image from Wolfram on ball.
Wikipedia on the curse of dimensionality says:
Thus, in some sense, nearly all of the high-dimensional space is “far away” from the centre, or, to put it another way, the high-dimensional unit space can be said to consist almost entirely of the “corners” of the hypercube, with almost no “middle”. (This is an important intuition for understanding the chi-squared distribution.)