Plot hclust with colored labels

Again I find myself trying to plot a cluster dendrogram with colored labels. With some insight from this post, I came up with the following function:

# matrix contains genomics-style data where columns are samples 
#   (if otherwise remove the transposition below)
# labels is a factor variable going along the columns of matrix
plotHclustColors <- function(matrix,labels,...) {
  colnames(matrix) <- labels
  d <- dist(t(matrix))
  hc <- hclust(d)
  labelColors <- brewer.pal(nlevels(labels),"Set1")
  colLab <- function(n) {
    if (is.leaf(n)) {
      a <- attributes(n)
      labCol <- labelColors[which(levels(lab) == a$label)]
      attr(n, "nodePar") <- c(a$nodePar, lab.col=labCol)
  clusDendro <- dendrapply(as.dendrogram(hc), colLab)

In action:


Binomial GLM for ratios of read counts

For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if this ratio is significantly associated with a given variable, x.

One way to proceed would be a linear regression of ratio ~ x. However the case when 100 reads cover a nucleotide is not statistically equivalent to the case when 2 reads cover a nucleotide. The binomial probabilities will become increasingly spiked at p*n as the number of reads n increases. So the case with 100 reads covering gives us more information than the case with 2 reads covering.

Here is a bit of code for using the glm() function in R with the binomial distribution with weights representing the covering reads.


n <- 100
# random poisson number of observations (reads)
reads <- rpois(n,lambda=5)
# make a N(0,2) predictor variable x
x <- rnorm(n,0,2)
# x will be negatively correlated with the target variable y
beta <- -1
# through a sigmoid curve mapping x*beta to probabilities in [0,1]
p <- exp(x*beta)/(1 + exp(x*beta))
# binomial distribution from the number of observations (reads)
y <- rbinom(n,prob=p,size=reads)

# plot the successes (y) over the total number of trials (reads)
# and order the x-axis by the predictor variable x
o <- order(x)
# more clear to see the relationship
# plot just the ratio
plot((y/reads)[o],type="h",col="red",ylab="ratio",xlab="rank of x")

# from help for glm():
# "For a binomial GLM prior weights are used to give 
# the number of trials when the response is the 
# proportion of successes"
fit <- glm(y/reads ~ x, weights=reads, family=binomial)

How wrong is hypergeometric test with one random margin?

In biostats and bioinformatics, the hypergeometric distribution is often used to assign probability of surprise to the amount of overlap between results and annotation, e.g.: 100 gene levels are changed by drug treatment and 50 of those genes are annotated as relating to immune system. The probability of surprise of such an overlap depends on the total number of genes examined in the analysis and the number of genes annotated as relating to the immune system.

However a hypergeometric test is not perfect for this application, as it assumes the margins are fixed (“margins” meaning the sums along the side of the contingency table, i.e. the number of changed genes and the number of immune system genes). While the annotation side might be considered fixed, the number of genes which are observed as changed is better considered a random variable, as it depends on the dataset.

What happens if one of the margins is a random variable? Here is a simple example showing how the null distribution of the number of genes in the intersection changes when one of the margins is allowed to vary by different amounts.

  • consider 100 genes, 20 annotated for a given category
  • black/blue density is randomly taking 20 genes as changed
  • red density is flipping a 0.2 coin 100 times and taking this many genes as changed
  • green densities are variations on a censored negative binomial, which has more variance than the binomial in red


Continue reading “How wrong is hypergeometric test with one random margin?”

Poisson regression

In trying to explain generalized linear models, I often say something like: GLMs are very similar to linear models but with different domains for the target y, e.g. positive numbers, outcomes in {0,1}, non-negative integers, etc. This explanation bypasses the more interesting point though, that the optimization problem for fitting the coefficients is totally different, after applying the link function.


This can be seen by comparing the coefficients from a linear regression of log counts to those from a Poisson regression. For some cases, the fitted lines are quite similar, however they diverge if you introduce outliers. A casual explanation here would be that the Poisson likelihood is thrown off more by high counts than by low counts; the high count pulls up the expected value for x=2 in the second plot, but the low count does not substantially pull down the expected value for x=3 in the third plot.

Continue reading “Poisson regression”

Splitting data

The caret package has a nice function for splitting up balanced subsets of data. Though I don’t see why I don’t get 3 rows out of 10 in this example. The p argument is defined as “the percentage of data that goes to training”.

d <- data.frame(x=rnorm(10), group=factor(c(1,1,1,2,2,2,3,3,3,3)))
            x group
1   1.0089900     1
2   0.4854706     1
3   1.7083259     1
4  -1.3362274     2
5   1.4905259     2
6   1.6451234     2
7   1.0361174     3
8   0.2369341     3
9  -2.0043264     3
10  1.4361718     3
d[createDataPartition(d$group, p=3/10)$Resample1,]
            x group
3   1.7083259     1
4  -1.3362274     2
8   0.2369341     3
10  1.4361718     3