I <3 Beta

In my work, I take two common measurements of log data in Wikipedia. I add things up or measure the proportion of something that falls into an interesting category. Whenever I add something up, I get to base a whole bunch of my modeling around normal distributions1. For those of you who don't write love letters to your statistical methods, a normal distribution is what the bell "curve" that the term "curved grading" is referring to2. I use it to understand how much error there is in my measurements.

Figure 1. A normal distribution. Commonly referred to as the "bell curve", this distribution is used to measure a common pattern of measurement error.
R
library(ggplot2)
g=ggplot(
	data.frame(
		x=seq(-5,5,0.1),
		y=dnorm(seq(-5,5,0.1),0,1)
	),
	aes(x=x,y=y)
) + 
geom_area(
	fill="#eeeeee"
) +
geom_line() +
scale_y_continuous("", breaks=c()) + 
theme_bw()
png("images/standard_normal.600.png", width=600, height=300, res=100)
print(g)
dev.off()
png("images/standard_normal.png", width=1800, height=900, res=300)
print(g)
dev.off()

When taking measurements, it's always good to assume that you'll make mistakes. There's a relevant saying in wood/metal shop work: "Measure twice, cut once." The idea is that you should assume that you'll make mistakes and behave so that those mistakes won't waste your time and expensive materials. When measuring data, we take this idea one step further. I don't want to just take a good enough measurement. I need to know how good my measurement is so that I can convince myself and other people of how good it is. By figuring out the size and frequency of the mistakes that are made (error), I can show how confident you (the reader I'm trying to convince) should be that the average of my measurements is close enough to the real value I'm trying to measure. The curve of the normal distribution helps me do that because it represents a deeply consistent property of random measurement error when that error is not constrained in any way.

Proportions work a little bit differently. When I say proportion, I mean the same thing as a ratio, a percentage or a fraction. With a proportion, I can examine the relationship of one thing to another. For example, the proportion of people who are unemployed in the US or the proportion of a Wikipedia editor's revisions that were reverted. The former is a bar bet settling statistic and the latter is one of my quick and dirty measures of the quality3 of an editor's work in Wikipedia.

Measurement error around a proportion is weird though. It turns out that the scale of my measurement errors is constrained. The largest value you can have when measuring a proportion is 1 or 100%. The smallest is 0 or 0%. Since you can't have values outside of 0 or 1, the error in my measurements behaves strangely when they gets close to these limits.

When I took an intro to statistics course in my second year of grad school, we were instructed to use the "normal approximation" when talking about proportions. This approach always bothered me. I'll show you why.

Figure 2. Example normal approximations. Normal approximations are plotted for two proportions: 5/10 = 0.50 and 9/10 = 0.90. Note how the 9/10 distribution gets cut off at p = 1.0.
R
library(ggplot2)
norm_approx = function(success, failure){
	n = failure+success
	p = success/n
	se = sqrt(p*(1-p)/n)
	function(x){
		dnorm(x, p, se)
	}
}
g=ggplot(
	rbind(
		data.frame(
			proportion="5/10 = 0.5",
			x=seq(0,1,0.01),
			y=norm_approx(5, 5)(seq(0,1,0.01))
		),
		data.frame(
			proportion="9/10 = 0.90",
			x=seq(0,1,0.01),
			y=norm_approx(9, 1)(seq(0,1,0.01))
		)
	),
	aes(x=x,y=y, group=proportion)
) + 
geom_area(
	aes(fill=proportion),
	alpha=0.5,
	position="dodge"
) +
geom_line(
	aes(color=proportion)
) + 
geom_vline(
	x=c(0,1),
	linetype=1
) + 
scale_x_continuous("p", breaks=c(0,0.5,1)) + 
scale_y_continuous("", breaks=c()) +
theme_bw()
png("images/normal_approximation.600.png", width=600, height=300, res=100)
print(g)
dev.off()
png("images/normal_approximation.png", width=1800, height=900, res=300)
print(g)
dev.off()

The normal approximation allows me to think about the error in proportions by jamming my measurment into the normal distribution discussed above. This approach works most of the time, but if you apply it improperly, you'll get into trouble. In figure 2, the 5/10 = 0.50 proportion looks fine but the 9/10 proportion is jammed up the right wall at p = 1.0 and cut off. As I said before, it's impossible to have a proportion value that is > 1, so I have to cut off the normal distribution's curve when it crosses that value. As you can image, this approximation is quite wrong®, but an interesting question is, how wrong?

Now, my statistics professor didn't just leave me with a broken way to think about measurement error. I was given a rule of thumb of when this approximation is and is not reasonable. The proportion being discussed must represent a ratio of two numbers that are greater than 10. In other words, 10:10 = 10/20 is the smallest number of observations I can have when applying this metric. However, under this rule, 9:11 = 9/20 would not be valid even though there are still 20 measurments because I've gotten too close to 1 or 0. This constraint enforces two things: (1) we have a minimum number of measurements so our confidence about them is high and (2) we can't meansure propotions that are really close to 0 or 1 unless we take a lot of measurements.

Since I tend to work with a large amounts of data, this contraint is usually not a problem. However, sometimes I want to reason about my measurement error when I can only take a few measurements. For example sometimes I want to measure the quality of a newcomer's work in Wikipedia. Since they are a newcomer, I haven't seen much of their activity yet. This was a serious problem for me. That is, until I discovered the meaning of the beta distribution. It turns out that the beta distribution is the thing we are trying to approximate with the normal curve.

Figure 3. Example beta distributions. The beta distributions corresponding to figure 2 are plotted for comparison. Note how the 9/10 distribution falls to zero as it reaches p = 1.0.
R
library(ggplot2)
beta = function(success, failure){
	function(x){
		dbeta(x, success+1, failure+1)
	}
}
g=ggplot(
	rbind(
		data.frame(
			proportion="5/10 = 0.5",
			x=seq(0,1,0.01),
			y=beta(5, 5)(seq(0,1,0.01))
		),
		data.frame(
			proportion="9/10 = 0.90",
			x=seq(0,1,0.01),
			y=beta(9, 1)(seq(0,1,0.01))
		)
	),
	aes(x=x,y=y, group=proportion)
) + 
geom_area(
	aes(fill=proportion),
	alpha=0.5,
	position="dodge"
) +
geom_line(
	aes(color=proportion)
) + 
geom_vline(
	x=c(0,1),
	linetype=1
) + 
scale_x_continuous("p", breaks=c(0,0.5,1)) + 
scale_y_continuous("", breaks=c()) +
theme_bw()
png("images/beta.600.png", width=600, height=300, res=100)
print(g)
dev.off()
png("images/beta.png", width=1800, height=900, res=300)
print(g)
dev.off()

Figure 3, the beta distributions representing my measurement error, looks similar to figure 2, the normal approximations. But on close inspection, there's an obvious difference for the 9/10 proportion. Rather than slamming into the right wall at 1.0, the bell curve is "skewed" and drops to zero just before reaching 1.0.

This is really cool because it tells me the rightTM thing about my measurement error. For the 9/10 proportion, the normal approximation says that it is possible that I'm looking at a 100% proportion because the value at 1.0 is not zero4. However, a we can defeat such nonsense without math. If my proportion represents 9/10, that means that one of the values I observed is not a part of the numerator. No matter how many more observations I take, it is impossible that the true proportion represents p = 1.0 or 100%. On the other hand, beta tells me that the probability that p ≥ 1.0 is zero-zip-nada.

By using the beta distribution to explain my measurement error, I can accurately compare measurements that have fewer observations. This opens a whole world of potential research questions to me that I had to ignore when using the normal approximation. So, I <3 the beta distribution. Sorry Jenny.

Footnotes

  1. The gamma distribution is actually more appropriate than the normal for a lot of measures of that involve summing things up, but the normal distribution is really close for almost everything and easier to do math with. The advantages of the gamma over the normal are similar to beta, so I may write about that in the future.
  2. When a course instructor uses a bell curve to "normalize" their students' scores, he essentially saying, "Any mistakes I've made in teaching or assessing your success in my class can be attributed to random noise and if the data doesn't look like that, I'll just force it to. No big deal. Enjoy your C regardless of how hard you worked, 68.27% of the class."
  3. I have better ways to measure the quality of an editor's work, but they are computationally difficult or require human labor.
  4. As it turns out, the density at any single point in a probability density function (the plots we are looking at) is exaclty zero no matter the heigh of the curve. This is just a confusing technicaility that you should ignore, but if I didn't put this here, some nerd might have blown a gasket at me.