For certain sequencing experiments (e.g. methylation data), one might end up with a ratio of read counts at a certain location satisfying a given property (e.g. ‘is methylated’) and want to test if this ratio is significantly associated with a given variable, x.
One way to proceed would be a linear regression of ratio ~ x. However the case when 100 reads cover a nucleotide is not statistically equivalent to the case when 2 reads cover a nucleotide. The binomial probabilities will become increasingly spiked at p*n as the number of reads n increases. So the case with 100 reads covering gives us more information than the case with 2 reads covering.
Here is a bit of code for using the glm() function in R with the binomial distribution with weights representing the covering reads.
n <- 100
# random poisson number of observations (reads)
reads <- rpois(n,lambda=5)
# make a N(0,2) predictor variable x
x <- rnorm(n,0,2)
# x will be negatively correlated with the target variable y
beta <- -1
# through a sigmoid curve mapping x*beta to probabilities in [0,1]
p <- exp(x*beta)/(1 + exp(x*beta))
# binomial distribution from the number of observations (reads)
y <- rbinom(n,prob=p,size=reads)
# plot the successes (y) over the total number of trials (reads)
# and order the x-axis by the predictor variable x
o <- order(x)
# more clear to see the relationship
# plot just the ratio
plot((y/reads)[o],type="h",col="red",ylab="ratio",xlab="rank of x")
# from help for glm():
# "For a binomial GLM prior weights are used to give
# the number of trials when the response is the
# proportion of successes"
fit <- glm(y/reads ~ x, weights=reads, family=binomial)