## Be precise

I’ve seen a lot of brash negativity lately on twitter. Here are 3 reasons why you shouldn’t say “x sucks” or “y FAIL” on twitter:

1. you are being sarcastic. sarcasm doesnt work on twitter and some people will think you are a jerk (so go ahead at your own risk).

2. x may be worse in a limited context, but it comes across as if you are thinking narrowly and not like a statistician. estimators have many different properties. e.g. the median is less efficient than the mean but has other advantages.

3. you could have made a mistake. humility goes a long way for your long-term reputation as a scientist.

Instead state things * plainly and precisely*. I’m not saying don’t criticize. It is possible to form precise critical statements in < 140 chars, or in a series of tweets.

Then there is the case that x is really not such a good method, and you think no one should use it. You should be prepared to back up your statements with data and analysis. You can use twitter to link out to long-form pieces where you provide evidence and code.

## How to use latex math in Rmd to display properly on GitHub Pages

Working on our PH525x online course material, Rafa and I wanted to base all lecture material in Rmd files, as these are easy for students to load into RStudio to walk through the code. Additionally, the rendered markdown files can be displayed nicely on GitHub Pages (which uses Jekyll to turn markdown in HTML). All we needed to do is copy the markdown files into the /pages directory and they already look pretty great when viewed on github.com (also we borrowed the theme from Karl Broman’s simple site template). By adding MathJax to the header of the GitHub Page template, we also render latex math formula from the original Rmd files.

Below I go over the issues we encountered with including math, but you can jump to the bottom if you just want our solution.

## How to check your simple definition of p-value

I just read Andrew Gelman’s post about an article with his name on it starting with an inaccurate definition of p-value. I sympathize with all parties. Journalists and editors are just trying to reduce technical terms by presenting layperson definitions. Earlier this year I caught a similar inaccurate definition on a site defining statistical terms for journalists. So admittedly, this wrong definition must be incredibly attractive to our minds for some reason.

A simple rule for testing if your definition is wrong:

*If knowing the truth could make the p-value (as you have defined it) go to zero, then it is wrongly defined.*

Let’s test:

**wrong definition**: p-value is the probability that a result is due to random chance

Suppose we know the truth, that the result is not due to random chance. Then the p-value as defined here should be zero. So this definition is wrong. The Wikipedia definition is too technical. I prefer the one from Gelman’s post:

**right definition**: “the probability of seeing something at least as extreme as the data, if the model […] were true”

where “model” refers to the boring model (e.g. the coin is balanced, the card guesser is not clairvoyant, the medicine is a sugar pill, etc.). This definition does not fail my test. We can calculate a probability under a model even when we know that model is not true.

## R gotchas

I put together a short list of potential R gotchas: unexpected results which might trip up new R users.

For example, if we have a matrix m,

`m[1:2,] `

returns a matrix, while

`m[1,] `

returns a vector, unless one writes

`m[1,,drop=FALSE]`

Setting aside what the default behavior of functions *should* be, my point is simply that these are important to be aware of. https://github.com/mikelove/r-gotchas/blob/master/README.md

## Empirical Bayes and the James-Stein rule

Suppose we observe 300 individual estimates which are distributed , with known.

Now if we assume , the James-Stein rule gives an estimator for which dominates .

Below is code for a little RStudio script to see how changing the balance of variance between data y and the variance of the means changes the amount of optimal shrinkage. For more info, read the paper referenced below! It uses the RStudio’s manipulate library: info on that.

# Stein's estimation rule and its competitors - an empirical Bayes approach # B Efron, C Morris, Journal of the American Statistical, 1973 n <- 300 sigma.means <- 5 means <- rnorm(n, 0, sigma.means) # sigma.y <- 5 library(manipulate) manipulate({ y <- rnorm(n,means,sigma.y) A <- sum((y/sigma.y)^2)/(n - 2) - 1 B <- 1/(1 + A) eb <- (1 - B) * y par(mfrow=c(2,1),mar=c(5,5,3,1)) plot(means, y, main="y ~ N(mean, sigma.y)\nmean ~ N(0, sigma.mean=5)") points(means, eb, col="red") legend("topleft","James-Stein estimators",col="red",pch=1) s <- seq(from=0,to=1,length=100) par(mar=c(5,5,1,1)) plot(s, sapply(s, function(b) sum((means - (1 - b)*y)^2)), type="l", xlab="possible values for B", ylab="sum squared error") points(B, sum((means - eb)^2),col="red",pch=16) legend("top","James-Stein B",col="red",pch=16) }, sigma.y = slider(1,10))

## More hclust madness

Here is a bit of code for making a heatmap, which orders the rows of a matrix such that the first column (as ordered by in the dendrogram) has all 0s then all 1s, then the 2nd column is similarly ordered in two groups conditioning on the 1st column and so on. Hard to explain but easy to see in the picture below. I came up with raising to the power of 2 quickly, but then it took me a few minutes to realize I have to multiply the columns by the order of the order.

x <- replicate(10,sample(0:1,100,TRUE)) library(gplots) hc <- hclust(dist(t(x))) y <- sweep(x,2,2^(ncol(x)-order(hc$order)),"*") z <- x[order(rowSums(y)),] heatmap.2(z, trace="none", key=FALSE, Rowv=FALSE,labRow=FALSE, Colv=as.dendrogram(hc), dendrogram="column", scale="none",col=c("grey","blue"), lwid=c(2,10))

## Plot hclust with colored labels

Again I find myself trying to plot a cluster dendrogram with colored labels. With some insight from this post, I came up with the following function:

library(RColorBrewer) # matrix contains genomics-style data where columns are samples # (if otherwise remove the transposition below) # labels is a factor variable going along the columns of matrix plotHclustColors <- function(matrix,labels,...) { colnames(matrix) <- labels d <- dist(t(matrix)) hc <- hclust(d) labelColors <- brewer.pal(nlevels(labels),"Set1") colLab <- function(n) { if (is.leaf(n)) { a <- attributes(n) labCol <- labelColors[which(levels(lab) == a$label)] attr(n, "nodePar") <- c(a$nodePar, lab.col=labCol) } n } clusDendro <- dendrapply(as.dendrogram(hc), colLab) plot(clusDendro,...) }

In action:

leave a comment