# Mike Love’s blog

## Splitting data

Posted in statistics by mikelove on April 27, 2013

The caret package has a nice function for splitting up balanced subsets of data. Though I don’t see why I don’t get 3 rows out of 10 in this example. The p argument is defined as “the percentage of data that goes to training”.

```
d <- data.frame(x=rnorm(10), group=c(1,1,1,2,2,2,3,3,3,3))
d
x group
1   1.0089900     1
2   0.4854706     1
3   1.7083259     1
4  -1.3362274     2
5   1.4905259     2
6   1.6451234     2
7   1.0361174     3
8   0.2369341     3
9  -2.0043264     3
10  1.4361718     3
library(caret)
d[createDataPartition(d\$group, p=3/10)\$Resample1,]
x group
3   1.7083259     1
4  -1.3362274     2
8   0.2369341     3
10  1.4361718     3

```

## Pipe to Rscript

Posted in statistics by mikelove on March 20, 2013

with this, I can switch from doing simple statistics on the command line using awk to using R, which is more familiar for me:

`blah blah blah | Rscript -e 'summary(scan(file("stdin")))'`

## Points and line ranges

Posted in statistics, visualization by mikelove on January 13, 2013

Two ways of plotting a grid of points and line ranges. I’m coming around to ggplot2. I recommend skimming the first few chapters of the book to understand what is going on – but it only takes about 30 min or so to understand enough to make basic plots.

```m <- 10
k <- 3
d <- data.frame(x=factor(rep(1:k,m)), y=rnorm(m*k), z=rep(1:m,each=k))
d\$ymax <- d\$y + 1
d\$ymin <- d\$y - 1

# pretty simple
library(ggplot2)
p <- ggplot(d, aes(x=x, y=y, ymin=ymin, ymax=ymax))
p + geom_pointrange() + theme_bw() + facet_wrap(~ z)

# messy
par(mfrow=c(3,4), mar=c(3,3,2,1))
for (i in 1:m) {
with(d[d\$z == i,], {
plot(as.numeric(x), y, main=i, xlim=c(0,k+1), ylim=c(-3,3), pch=20, xaxt="n")
axis(1,at=1:k,1:k)
segments(as.numeric(x),ymin,as.numeric(x),ymax)
})
}
```

## Plotting hclust

Posted in statistics, visualization by mikelove on August 8, 2012

After many years I’ve finally worked out the x and y coordinates of the points in plot.hclust.

```hang <- 0.07
hc <- hclust(dist)
plot(hc)
pt.heights <- c(hc\$height[hc\$merge[,1] < 0],hc\$height[hc\$merge[,2] < 0])[order(-1 * c(hc\$merge[,1][hc\$merge[,1] < 0],hc\$merge[,2][hc\$merge[,2] < 0]))]
points(1:length(hc\$order), pt.heights[hc\$order] - hang)
```

## Block bootstrap

Posted in genetics, statistics by mikelove on July 28, 2012

In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might want to know how often is it raining in two cities at the same time, and if this is more than expected by chance. But it is more likely to rain on a given day if it was raining the day before, and this dependence will change the distribution of overlap expected by chance. In stochastics, this is a question of whether the process is ‘stationary‘.

One way out of the problem of estimating the distribution of overlap of two process by chance is the block bootstrap. Instead of randomly shifting features in the sequence (what I call naive permutation), you randomly build new sequences from large blocks of the original sequence. Then a distribution can be formed of overlap of features by chance. Here is a single bootstrap sample (top sequence) constructed in this manner from the data (bottom sequence).

## PCA on training and test data

Posted in statistics by mikelove on May 29, 2012

In the past months, I heard some talks where dimension reduction (e.g. taking the top k principal components) was used on the full data set before splitting the data into training and test sets. My first intuition was that this kind of “peeking” at the test set would inflate the accuracy on the test set. On the other hand, one could argue that as long as the dimension reduction is unsupervised (not aware of class labels), then it should make no difference. After simulating some examples, I can’t find a situation where the accuracy on a test set used in dimension reduction is inflated relative to a “doubly” held-out test set.

## Sparse matrix operations in python and R

Posted in math, statistics by mikelove on April 16, 2012

I’m comparing two libraries for working with sparse matrices: scipy.sparse for python and Matrix for R.

I create a matrix of 10 million rows x 100 columns, filled randomly with 100 million standard normals (which overlap).

Python wins matrix creation, they are about equal for multiplication (XtX), python wins also taking row means, slicing 25 columns and then binding the columns together.
.
.

```> python sparse.py -m 1e7 -n 1e2 -l 1e8
create: 19.51
multiply: 36.08
row means: 2.27
column slices: 4.88
hstack: 1.56

> R --no-save --slave '--args m=1e7 n=1e2 l=1e8' < sparse.R
create: 40.548
multipy: 40.034
row means: 9.327
column slices: 6.755
cbind: 20.167
```

.
.
(more…)