For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if this ratio is significantly associated with a given variable, x.

One way to proceed would be a linear regression of ratio ~ x. However the case when 100 reads cover a nucleotide is not statistically equivalent to the case when 2 reads cover a nucleotide. The binomial probabilities will become increasingly spiked at p*n as the number of reads n increases. So the case with 100 reads covering gives us more information than the case with 2 reads covering.

Here is a bit of code for using the glm() function in R with the binomial distribution with weights representing the covering reads.

n <- 100
# random poisson number of observations (reads)
reads <- rpois(n,lambda=5)
reads
# make a N(0,2) predictor variable x
x <- rnorm(n,0,2)
# x will be negatively correlated with the target variable y
beta <- -1
# through a sigmoid curve mapping x*beta to probabilities in [0,1]
p <- exp(x*beta)/(1 + exp(x*beta))
# binomial distribution from the number of observations (reads)
y <- rbinom(n,prob=p,size=reads)
# plot the successes (y) over the total number of trials (reads)
# and order the x-axis by the predictor variable x
par(mfrow=c(2,1),mar=c(2,4.5,1,1))
o <- order(x)
plot(reads[o],type="h",lwd=2,ylab="reads",xlab="",xaxt="n")
points(y[o],col="red",type="h",lwd=2)
points(reads[o],type="p",pch=20)
points(y[o],col="red",type="p",pch=20)
par(mar=c(4.5,4.5,0,1))
# more clear to see the relationship
# plot just the ratio
plot((y/reads)[o],type="h",col="red",ylab="ratio",xlab="rank of x")
points((y/reads)[o],type="p",col="red",pch=20)
# from help for glm():
# "For a binomial GLM prior weights are used to give
# the number of trials when the response is the
# proportion of successes"
fit <- glm(y/reads ~ x, weights=reads, family=binomial)
summary(fit)