Mike Love’s blog

How to check your simple definition of p-value

Posted in statistics by mikelove on October 16, 2014

I just read Andrew Gelman’s post about an article with his name on it starting with an inaccurate definition of p-value. I sympathize with all parties. Journalists and editors are just trying to reduce technical terms by presenting layperson definitions. Earlier this year I caught a similar inaccurate definition on a site defining statistical terms for journalists. So admittedly, this wrong definition must be incredibly attractive to our minds for some reason.

A simple rule for testing if your definition is wrong:

  • If knowing the truth could make the p-value (as you have defined it) go to zero, then it is wrongly defined.

Let’s test:

  • wrong definition: p-value is the probability that a result is due to random chance

Suppose we know the truth, that the result is not due to random chance. Then the p-value as defined here should be zero. So this definition is wrong. The Wikipedia definition is too technical. I prefer the one from Gelman’s post:

  • right definition: “the probability of seeing something at least as extreme as the data, if the model […] were true”

where “model” refers to the boring model (e.g. the coin is balanced, the card guesser is not clairvoyant, the medicine is a sugar pill, etc.). This definition does not fail my test. We can calculate a probability under a model even when we know that model is not true.

R gotchas

Posted in uncategorized by mikelove on June 2, 2014

I put together a short list of potential R gotchas: unexpected results which might trip up new R users.

For example, if we have a matrix m,

m[1:2,]

returns a matrix, while

m[1,]

returns a vector, unless one writes

m[1,,drop=FALSE]

Setting aside what the default behavior of functions should be, my point is simply that these are important to be aware of. https://github.com/mikelove/r-gotchas/blob/master/README.md

Empirical Bayes and the James-Stein rule

Posted in statistics by mikelove on November 7, 2013

Suppose we observe 300 individual estimates y_i which are distributed \mathcal{N}(m_i, \sigma_y^2), with \sigma_y known.

Now if we assume m_i \sim \mathcal{N}(0, \sigma_m^2), the James-Stein rule gives an estimator for m_i which dominates y_i .

Below is code for a little RStudio script to see how changing the balance of variance between data y and the variance of the means changes the amount of optimal shrinkage. For more info, read the paper referenced below! It uses the RStudio’s manipulate library: info on that.

Screen Shot 2013-11-07 at 12.48.41 PM

# Stein's estimation rule and its competitors - an empirical Bayes approach
# B Efron, C Morris, Journal of the American Statistical, 1973
n <- 300
sigma.means <- 5
means <- rnorm(n, 0, sigma.means)
# sigma.y <- 5
library(manipulate)
manipulate({
  y <- rnorm(n,means,sigma.y)
  A <- sum((y/sigma.y)^2)/(n - 2) - 1
  B <- 1/(1 + A)
  eb <- (1 - B) * y
  par(mfrow=c(2,1),mar=c(5,5,3,1))
  plot(means, y, main="y ~ N(mean, sigma.y)\nmean ~ N(0, sigma.mean=5)")
  points(means, eb, col="red")
  legend("topleft","James-Stein estimators",col="red",pch=1)
  s <- seq(from=0,to=1,length=100)
  par(mar=c(5,5,1,1))
  plot(s, sapply(s, function(b) sum((means - (1 - b)*y)^2)),
       type="l", xlab="possible values for B", ylab="sum squared error")
  points(B, sum((means - eb)^2),col="red",pch=16)
  legend("top","James-Stein B",col="red",pch=16)
}, sigma.y = slider(1,10))

More hclust madness

Posted in statistics, visualization by mikelove on October 24, 2013

Here is a bit of code for making a heatmap, which orders the rows of a matrix such that the first column (as ordered by in the dendrogram) has all 0s then all 1s, then the 2nd column is similarly ordered in two groups conditioning on the 1st column and so on. Hard to explain but easy to see in the picture below. I came up with raising to the power of 2 quickly, but then it took me a few minutes to realize I have to multiply the columns by the order of the order.

x <- replicate(10,sample(0:1,100,TRUE))
library(gplots)
hc <- hclust(dist(t(x)))
y <- sweep(x,2,2^(ncol(x)-order(hc$order)),"*")
z <- x[order(rowSums(y)),]
heatmap.2(z, trace="none", key=FALSE,
          Rowv=FALSE,labRow=FALSE,
          Colv=as.dendrogram(hc),
          dendrogram="column",
          scale="none",col=c("grey","blue"),
          lwid=c(2,10))

hclustOrder

Plot hclust with colored labels

Posted in statistics, visualization by mikelove on October 16, 2013

Again I find myself trying to plot a cluster dendrogram with colored labels. With some insight from this post, I came up with the following function:

library(RColorBrewer)
# matrix contains genomics-style data where columns are samples 
#   (if otherwise remove the transposition below)
# labels is a factor variable going along the columns of matrix
plotHclustColors <- function(matrix,labels,...) {
  colnames(matrix) <- labels
  d <- dist(t(matrix))
  hc <- hclust(d)
  labelColors <- brewer.pal(nlevels(labels),"Set1")
  colLab <- function(n) {
    if (is.leaf(n)) {
      a <- attributes(n)
      labCol <- labelColors[which(levels(lab) == a$label)]
      attr(n, "nodePar") <- c(a$nodePar, lab.col=labCol)
    }
    n
  }
  clusDendro <- dendrapply(as.dendrogram(hc), colLab)
  plot(clusDendro,...)
}

In action:

hclustColor

Jacob and Monod

Posted in genetics by mikelove on September 8, 2013

Binomial GLM for ratios of read counts

Posted in statistics by mikelove on September 2, 2013

For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if this ratio is significantly associated with a given variable, x.

One way to proceed would be a linear regression of ratio ~ x. However the case when 100 reads cover a nucleotide is not statistically equivalent to the case when 2 reads cover a nucleotide. The binomial probabilities will become increasingly spiked at p*n as the number of reads n increases. So the case with 100 reads covering gives us more information than the case with 2 reads covering.

Here is a bit of code for using the glm() function in R with the binomial distribution with weights representing the covering reads.

binomialGLM

n <- 100
# random poisson number of observations (reads)
reads <- rpois(n,lambda=5)
reads
# make a N(0,2) predictor variable x
x <- rnorm(n,0,2)
# x will be negatively correlated with the target variable y
beta <- -1
# through a sigmoid curve mapping x*beta to probabilities in [0,1]
p <- exp(x*beta)/(1 + exp(x*beta))
# binomial distribution from the number of observations (reads)
y <- rbinom(n,prob=p,size=reads)

# plot the successes (y) over the total number of trials (reads)
# and order the x-axis by the predictor variable x
par(mfrow=c(2,1),mar=c(2,4.5,1,1))
o <- order(x)
plot(reads[o],type="h",lwd=2,ylab="reads",xlab="",xaxt="n")
points(y[o],col="red",type="h",lwd=2)
points(reads[o],type="p",pch=20)
points(y[o],col="red",type="p",pch=20)
par(mar=c(4.5,4.5,0,1))
# more clear to see the relationship
# plot just the ratio
plot((y/reads)[o],type="h",col="red",ylab="ratio",xlab="rank of x")
points((y/reads)[o],type="p",col="red",pch=20)

# from help for glm():
# "For a binomial GLM prior weights are used to give 
# the number of trials when the response is the 
# proportion of successes"
fit <- glm(y/reads ~ x, weights=reads, family=binomial)
summary(fit)
Follow

Get every new post delivered to your Inbox.