Thursday, July 2, 2009
The Need for Speed
On the drive from San Juan to Arecibo this morning, I got to wondering about where my average driving speed fell in the distribution of drivers here in Puerto Rico. In the states, I felt like I was a pretty average driver, but here en la isla, the distribution of driving speeds is different. There are a lot of fast drivers, too be sure, but there is also a subpopulation of drivers whose speed is significantly (~10 mph) below the speed limit. This may be because relative to the US, PR is economically depressed and so more old cars are on the road, or as a reaction to the more erratic driving habits there seem to be here, but anyway, I definitely pass more people than pass me now.
So in an effort to discover where my driving speed fell relative to others (and in and effort to alleviate the boredom of driving 1.5 hrs alone), I started counting how many cars I passed and how many passed me as I was going 65 mph (the speed limit). Out of 55 pass events, only 9 involved me getting passed. To make this a tractable problem in my head, I decided to assume that driving speeds were normally (gaussian) distributed about a mean--even though this contradicts my anecdotal evidence above. Using this approximation, my first instinct was to say that 1/6 of the cars on the road were faster than me, and since ~2/3 of samples are within +/- 1 sigma of the mean in a gaussian distribution, 1/6 of the samples would be above +1 sigma. So I approximated that I was a 1 sigma driver.
But then it occurred to me that I needed to control for a significant sample bias. This is because the test I was doing wasn't randomly selecting cars and comparing my speed to them. Cars were far more likely to get selected if the difference between their speed and my speed was large. A car going the same speed as me would never pass me, and I would never pass it. But I would assuredly pass almost every car on the road that was going 10 mph as I went 65. The "road distance" that I sampled for different velocities is proportional to abs(v-v0), where v0 is my velocity. The effect this had on my samples was to underweight speeds close to my own and overweight the wings of the gaussian distribution I had assumed as my model. If I drove exactly the mean velocity, this effect would not be terribly important--if the model were correct, it would still be the case that as many cars passed me as I passed. But as my velocity moves away from the mean velocity, the "normal drivers" who are only going a little faster than me get undersampled, so I only see the drivers who are tearing around like a bat out of hell. At the lower end, I still see the real slow-pokes on the road, but I start seeing people who are going a bit faster than that, of which there are a lot more. The effect of this sample bias, it seems, would be to make it seem that I'm a farther outlier in my driving speed than I actually am.
So now, to figure out where I fall in the (normal) distribution of driving speeds, I need to know exactly what the mean driving speed and what sigma is, so that I can compensate for the abs(v-v0)
sampling factor. That means I need to figure out 2 numbers, but unfortunately, I only measured 1 number (that 1/6 of the passes while driving were me being passed) so I won't be able to properly constrain this problem. However, I should be able to figure out the mean on my drive home by finding the speed at which as many people pass me as I pass. For now, let's say this is 55 mph. Then all I need to do is find the sigma for which a gaussian distribution around 55 mph downweighted by abs(v-65 mph) has 1/6 of the area lying above 65 mph. I just solved that numerically on my computer, and it's saying that the best-fit sigma is ~22 mph. So that puts me at about +1/2 sigma. That seems reasonable.
An interesting next step (which I'm not going to do right now since I need to get to work) would be to translate the sample error in my pass measurements into an error in the determination of sigma, and then the error in my driving speed percentile.