Working on our PH525x online course material, Rafa and I wanted to base all lecture material in Rmd files, as these are easy for students to load into RStudio to walk through the code. Additionally, the rendered markdown files can be displayed nicely on GitHub Pages (which uses Jekyll to turn markdown in HTML). All we needed to do is copy the markdown files into the /pages directory and they already look pretty great when viewed on github.com (also we borrowed the theme from Karl Broman’s simple site template). By adding MathJax to the header of the GitHub Page template, we also render latex math formula from the original Rmd files.
Below I go over the issues we encountered with including math, but you can jump to the bottom if you just want our solution.
Continue reading “How to use latex math in Rmd to display properly on GitHub Pages”
I’m comparing two libraries for working with sparse matrices: scipy.sparse for python and Matrix for R.
I create a matrix of 10 million rows x 100 columns, filled randomly with 100 million standard normals (which overlap).
Python wins matrix creation, they are about equal for multiplication (XtX), python wins also taking row means, slicing 25 columns and then binding the columns together.
> python sparse.py -m 1e7 -n 1e2 -l 1e8
row means: 2.27
column slices: 4.88
> R --no-save --slave '--args m=1e7 n=1e2 l=1e8' < sparse.R
row means: 9.327
column slices: 6.755
Continue reading “Sparse matrix operations in python and R”
Just what I’ve been looking for…
Measuring and testing dependence by correlation of distances
Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlation is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.
R/MATLAB package: energy
A smoothed bootstrap test for independence based on mutual information
A test for independence of multivariate time series based on the mutual information measure is proposed. First of all, a test for independence between two variables based on i.i.d. (time-independent) data is constructed and is then extended to incorporate higher dimensions and strictly stationary time series data. The smoothed bootstrap method is used to estimate the null distribution of mutual information. The experimental results reveal that the proposed smoothed bootstrap test performs better than the existing tests and can achieve high powers even for moderate dependence structures. Finally, the proposed test is applied to assess the actual independence of components obtained from independent component analysis (ICA).
David Skillicorn’s book Understanding Complex Datasets presents a method for picking k: the number of meaningful components from a principal components analysis / SVD. I tried implementing the algorithm for a simulated dataset. I generate n observations from k components of differing size in p dimensions, then add noise.
The scree plot is a plot of the variance explained by each next principal component / reduced rank approximation of the original data. This is by definition a decreasing plot, and it is suggested to look for a flattening of the curve to indicate the point at which the remaining components are noise.
The method mentioned in Skillicorn is to look at the residual matrix after subtracting away a reduced rank approximation of the data. This matrix either still has some structure from remaining components or is just noise. If it is just noise, then swapping the signs of the entries should not change the 2-norm (the largest singular value) very much (I swapped the signs 10 times and took the average of all the 2-norms). The proposed measure of residual structure is the 2-norm of the residual matrix minus the 2-norm of the altered matrix over the Frobenius norm of the residual matrix. I then take the log of this, so the minimum is easier to see in a plot.
Here is the R code.
Here’s an easy one:
Here’s a plot of one instance, where the true number of components is 15, but with lots of noise added which overpowers the smaller components. One nice property is that it is fairly convex, because the Frobenius norm of the residual matrix in the denominator goes to zero.
This is a bizarre feature of geometery that I forgot about: the volume of a the unit n-dimensional ball tends to zero as n increases past 5. A unit n-ball (or hypersphere) is the set of points that can be reached by going 1 unit in any direction. I think this is really counter-intuitive.
Image from Wolfram on ball.
Wikipedia on the curse of dimensionality says:
Thus, in some sense, nearly all of the high-dimensional space is “far away” from the centre, or, to put it another way, the high-dimensional unit space can be said to consist almost entirely of the “corners” of the hypercube, with almost no “middle”. (This is an important intuition for understanding the chi-squared distribution.)