More hclust madness

Here is a bit of code for making a heatmap, which orders the rows of a matrix such that the first column (as ordered by in the dendrogram) has all 0s then all 1s, then the 2nd column is similarly ordered in two groups conditioning on the 1st column and so on. Hard to explain but easy to see in the picture below. I came up with raising to the power of 2 quickly, but then it took me a few minutes to realize I have to multiply the columns by the order of the order.

x <- replicate(10,sample(0:1,100,TRUE))
library(gplots)
hc <- hclust(dist(t(x)))
y <- sweep(x,2,2^(ncol(x)-order(hc$order)),"*")
z <- x[order(rowSums(y)),]
heatmap.2(z, trace="none", key=FALSE,
          Rowv=FALSE,labRow=FALSE,
          Colv=as.dendrogram(hc),
          dendrogram="column",
          scale="none",col=c("grey","blue"),
          lwid=c(2,10))

hclustOrder

Advertisements

Plot hclust with colored labels

Again I find myself trying to plot a cluster dendrogram with colored labels. With some insight from this post, I came up with the following function:

library(RColorBrewer)
# matrix contains genomics-style data where columns are samples 
#   (if otherwise remove the transposition below)
# labels is a factor variable going along the columns of matrix
plotHclustColors <- function(matrix,labels,...) {
  colnames(matrix) <- labels
  d <- dist(t(matrix))
  hc <- hclust(d)
  labelColors <- brewer.pal(nlevels(labels),"Set1")
  colLab <- function(n) {
    if (is.leaf(n)) {
      a <- attributes(n)
      labCol <- labelColors[which(levels(lab) == a$label)]
      attr(n, "nodePar") <- c(a$nodePar, lab.col=labCol)
    }
    n
  }
  clusDendro <- dendrapply(as.dendrogram(hc), colLab)
  plot(clusDendro,...)
}

In action:

hclustColor

Points and line ranges

Two ways of plotting a grid of points and line ranges. I’m coming around to ggplot2. I recommend skimming the first few chapters of the book to understand what is going on – but it only takes about 30 min or so to understand enough to make basic plots.

m <- 10
k <- 3
d <- data.frame(x=factor(rep(1:k,m)), y=rnorm(m*k), z=rep(1:m,each=k))
d$ymax <- d$y + 1
d$ymin <- d$y - 1

# pretty simple
library(ggplot2)
p <- ggplot(d, aes(x=x, y=y, ymin=ymin, ymax=ymax))
p + geom_pointrange() + theme_bw() + facet_wrap(~ z)

# messy
par(mfrow=c(3,4), mar=c(3,3,2,1))
for (i in 1:m) {
  with(d[d$z == i,], {
    plot(as.numeric(x), y, main=i, xlim=c(0,k+1), ylim=c(-3,3), pch=20, xaxt="n")
    axis(1,at=1:k,1:k)
    segments(as.numeric(x),ymin,as.numeric(x),ymax)
  })  
}

Plotting hclust

After many years I’ve finally worked out the x and y coordinates of the points in plot.hclust.

hang <- 0.07
hc <- hclust(dist)
plot(hc)
pt.heights <- c(hc$height[hc$merge[,1] < 0],hc$height[hc$merge[,2] < 0])[order(-1 * c(hc$merge[,1][hc$merge[,1] < 0],hc$merge[,2][hc$merge[,2] < 0]))]
points(1:length(hc$order), pt.heights[hc$order] - hang)

R one liner: histogram for integers

Here is a function for making better histograms for integers.

For example, you have

x = c(1,1,1,1,1,2,2,2,2,6,6,6,6,7,7,7,7,7,8,8)

If you call hist(x) you get the 1s and 2s in the same bin.

int.hist = function(x,ylab="Frequency",...) {
barplot(table(factor(x,levels=min(x):max(x))),space=0,xaxt="n",ylab=ylab,...);axis(1)
}

Then trying int.hist(x):