Monday, November 9, 2009

The Need for SPEAD

I've been absent for a good while now as a result of participating in a (successful) deployment of our PAPER experiment in South Africa. The Karoo desert in SA, where we were stationed, was very reminiscent of Rangely, CO where I grew up, except for the occasional baboon or kudu in the road. Though it came at a price of a lot of work piled up for me when I got back, and an awfully long time away from J, the isolation from all but our experiment helped ferment some ideas I'd been having about migrating the AIPY toolkit I've been developing to use a streaming data format that would avoid unnecessary disk accesses, would allow AIPY to be integrated directly with the correlators developed by our CASPER project, and would help our experiment develop a real-time analysis pipeline for compensating for ionospheric distortion in our data.

After chatting with a lot of guys working on the Karoo Array Telescope in Cape Town, we came up with a concrete protocol build on something already being used for CASPER correlator output. I just got done writing my first grant proposal to the NSF, funding a graduate student to work on this protocol--the Streaming Protocol for Exchanging Astronomical Data (SPEAD, pronounced "speed"). The process of writing a grant myself was a learning process, and helped me understand where a lot of the questions I got asked by my previous advisors were coming from.

A lesson I got to take away from SA was this: the reason we were in SA (as opposed to Australia) for PAPER was because we had been working with the KAT team, sharing correlator development. The reason we were working with the KAT team was because CASPER and KAT started up a collaboration a few years before. And that collaboration was started up because Dan Werthimer went down to visit SA some years ago to help advise them in a review of the design of their telescope electronics. Dan was invited there because he struck up a fast friendship with Alan Langman (the KAT director) at an earlier conference. The moral of this chain of causes and effects being that sometimes large projects go in new directions because of personal friendships, and sometimes those friendships end up making the difference in the success of a project.

Sunday, September 6, 2009

"Guess What I Was Thinking" Logic Puzzles (A Rant)

Probably more than most, I like logic puzzles. They're fun. But there's a variety of logic puzzles, especially prevalent in IQ/Mensa tests, that I really dislike. They are the "what's the next symbol in the pattern"-type of puzzles, and I HATE them. They are written like there is one (and only one) answer, and you must be dense not to see it. But can't I put anything in that blank space and call it a "pattern"? And what if the pattern that I see when looking at the provided sequence isn't the one you were thinking of? Is anyone with me in recognizing that these aren't logic puzzles at all? They're "guess what I was thinking of" puzzles! They aren't adequately constrained. They don't specify the parameter space from which the sequence is drawn. Anything can be a pattern! Anything!

What they actually want you to do is find the most likely symbol given several measurements and a set of priors about the likelihood of the author picking a particular sequence, but they neglect to provide you with any information about those priors. Maybe they assume that you can guess the priors based on estimates of your own sequence-picking priors, but that only works if your brain works the same way as the authors'. And quite frankly, if the authors can't appreciate that answers to these puzzles they are writing are indeterminate, I'm pretty sure their brain isn't working the same way as mine. Quit calling these logic puzzles! Put them on a Berkeley Psychic Institute entrance exam, not a college entrance exam.

Ok. Done ranting.

Wednesday, August 26, 2009

Graph-SLAM

After a trying, but ultimately successful month spent extracting the family from Puerto Rico and re-embedding us in Berkeley, I'm just starting to get back on top of things enough to think about posting...

I had lunch yesterday with a good friend of mine, Pierre, who co-founded a company that specializes in sensory and mapping systems such as those that are used to create Google's "Street View". I was impressed to learn about their system for combining data from GPS, LIDAR, car odometers, and IMUs to create a consistent picture of how a vehicle is located and oriented in space as a function of time. They've spent a lot of time calibrating their systems, and use some sophisticated MCMC post-processing methods for deriving the actual trajectory of a vehicle.

Although the antennas in the PAPER array (that's the low-frequency interferometer I'm working on), are much less mobile than a car, there was considerable overlap between the problem Pierre has been working to solve and the calibration problem I am facing the requires positioning antennas and celesital sources as a function of time in the face of ionospheric distortion, variable gains, etc. Pierre pointed me to Graph-SLAM as a formal description of the problem that we are trying to solve, and suggested that Kalman Filtering with RTS Smoothing was a powerful technique for converging to the optimal solution (with covariance information) in linear time.

Tuesday, July 14, 2009

The Voynich Manuscript: Bootstrapping Language

The internet is a powerful and dangerous thing. It all started when I read the latest xkcd, which warned me that visiting cracked.com was dangerous. Then I read about the "6 phenomena that science can't explain" (which was a very dramatic title for some underwhelming mysteries), and next thing I knew, I was reading about the Voynich Manuscript. I learned about cryptography, glossolalia, the Manchu language, among other things. Then I took a look at the manuscript and before I knew it, I had a transcribed version of the manuscript in electronic form using the European Voynich Alphabet. And it just went downhill from there.

To summarize, the Voynich Manuscript (hereafter VMS) is a handwritten text with some illustrations some 500 years old. It uses glyphs no one knows how to read, it is not clear if it corresponds to a known language, it may or may not be encrypted, and little progress has been made in deciphering any of it, despite the fact that some bright people have tried. So I decided to have a crack at it.

The reason I got interested is because of the similaries to SETI. Arecibo, back in 1974, transmitted a message off into space that had been designed to be decrypted. We might receive a message like that some day. Or we might intercept something much like the VMS--a bunch of data in a language that we have no prior knowledge of--and we may be finding ourself trying to figure out how to bootstrap a language. That is to say, to learn the grammar and semantics of a language from a static example, without outside help.

Is this possible? For grammar, I'm pretty sure of it. I can imagine an algorithm (maybe Maximum Entropy Modeling and Bayesian learning applied to grouping and parsing) that uses correlations in the appearances of language elements (starting with letters and building up) and correlations in the behaviors of these elements relative to one another to build a model for parsing a language. For the VMS, I used something similar to this (not the MEM and Bayesian part) to show that spaces, line breaks, and paragraph breaks show similar grouping correlations relative to other VMS letters, and so can probably be considered one grammatical element of whitespace. That's a pretty simple thing to deduce, but it was actually something I was worried about in getting started with the VMS.

Sematics is another issue. Once upon a time, I would have had an optomistic answer to bootstrapping sematics from text that was not written for that purpose. However, after watching my child mysteriously acquire language, illustrating how hard-wired the human brain is for learning language from another human, and how much it relies on shared experience and feedback, I'm less sure.

I would be interested to know if there is a field of mathematics that studies sematics and the properties that a self-contained system needs to have to be able to generate sematical relationships. The Arecibo message relied on a shared physical environment to try to bootstrap sematics. I wonder if it would be enough to describe the rules of the grammar of a language in the language itself. That one, once the reader had deduced the relationships between elements, you would have a shared knowledge of that subject that might enable a reader to correlate the structure of the descriptions with the grammatical structure and thereby establish the first sematical relationships.

Anyway, after preliminary analysis of the VMS, I'm pretty sure that it's not random gibberish (there are correlations between elements on levels ranging from letters to words), and if it's encrypted, it's a weak form of encryption that preserves these correlations. My pet theory, extended from the glossalalia idea, is that this is actually plaintext in a natural language with an invented set of symbols, but that the natural language might be the accidental or intentional creation of a savant or scholar.

Thursday, July 9, 2009

Countability and Strong Positive Anymore

Seven or eight years ago, I discovered that I have a linguistic condition called "Positive Anymore." I was in college, chatting with my roommates, and said something like "Anymore, I just lift weights in the Leverett gym." One of my roommates just couldn't take it anymore. "I've heard you say 'anymore' like that for years now. What the hell does it mean?" A quick poll of those present revealed that I was the only one for whom that construction made grammatical sense. My aunt (a linguistics professor) gave me the prognosis: I had Positive Anymore.

I used to think that PA was a linguistic shortcoming of mine, but anymore I'm convinced it's more like a superpower. Whereas most English speakers can only use the word in negative constructions like "I don't drive anymore" I have the uncanny ability to use it positively as per the first sentence in this paragraph. Moreover, I don't just have PA, I have strong PA, which means I can, at will, detach "anymore" and put it anywhere in a sentence. "Anymore, I just take the bus." Astounded yet?
If you're still having trouble parsing that, replace "anymore" with "nowadays"--to me, they mean about the same thing.

I receive no end of flak from friends, relatives, and spouses about my PA, although it's not that uncommon a condition. In fact, I have caught several of my relatives (mostly on my dad's side) using PA even after making fun of me for my PA. They have it and don't even know it.

Anyway, S and I were trying to figure out if my use of PA was inconsistent with my use of "any", which I use according to the standard rules. But it turns out the standard rules are weirder than you might think. For example, "I don't want any spam" is a grammatical negative construction; "I want any spam" is ungrammatical and positive. "Do you want any spam?" is grammatical and seems positive, but it turns out that there is an implied negative in English owing to the uncertainty inherent in questions and subjunctives. Hmph. And then what about "I like any spam I can get?" That seems positive again, but the clause "I can get" is required to make it grammatical. And then the plot thickens. "I feed spam to any dog" is grammatical and did not require a clause to modify "any dog."

Our theory was that using "any" positively without a clause requires the noun modified by "any" to come in quantized units--to be countable. Dogs are countable; spam is a continuum, much like water or space-time. The best example we could think of that illustrated this was "fish". Fish can be countable noun (number of live fishes) or continuum noun (amount of dead fish to eat). If I say "I'll take any fish," the ambiguity in the countability of fish is broken--it's clear I'm talking about a live fish (or, perhaps, a type of fish, which is also countable). But if I say "I'll take any fish that you give me," the ambiguity is preserved.

The implication was that for my use of PA to be consistent with the standard use of "any" (which it may not need to be, since "anymore" is one word, not "any more"), I must be thinking of the time interval referred to by PA (i.e. now and continuing indefinitely into the future) as something countable rather than continuous. Maybe. I don't know. Anymore, I'm just really confused.

Thursday, July 2, 2009

The Need for Speed


On the drive from San Juan to Arecibo this morning, I got to wondering about where my average driving speed fell in the distribution of drivers here in Puerto Rico. In the states, I felt like I was a pretty average driver, but here en la isla, the distribution of driving speeds is different. There are a lot of fast drivers, too be sure, but there is also a subpopulation of drivers whose speed is significantly (~10 mph) below the speed limit. This may be because relative to the US, PR is economically depressed and so more old cars are on the road, or as a reaction to the more erratic driving habits there seem to be here, but anyway, I definitely pass more people than pass me now.

So in an effort to discover where my driving speed fell relative to others (and in and effort to alleviate the boredom of driving 1.5 hrs alone), I started counting how many cars I passed and how many passed me as I was going 65 mph (the speed limit). Out of 55 pass events, only 9 involved me getting passed. To make this a tractable problem in my head, I decided to assume that driving speeds were normally (gaussian) distributed about a mean--even though this contradicts my anecdotal evidence above. Using this approximation, my first instinct was to say that 1/6 of the cars on the road were faster than me, and since ~2/3 of samples are within +/- 1 sigma of the mean in a gaussian distribution, 1/6 of the samples would be above +1 sigma. So I approximated that I was a 1 sigma driver.

But then it occurred to me that I needed to control for a significant sample bias. This is because the test I was doing wasn't randomly selecting cars and comparing my speed to them. Cars were far more likely to get selected if the difference between their speed and my speed was large. A car going the same speed as me would never pass me, and I would never pass it. But I would assuredly pass almost every car on the road that was going 10 mph as I went 65. The "road distance" that I sampled for different velocities is proportional to abs(v-v0), where v0 is my velocity. The effect this had on my samples was to underweight speeds close to my own and overweight the wings of the gaussian distribution I had assumed as my model. If I drove exactly the mean velocity, this effect would not be terribly important--if the model were correct, it would still be the case that as many cars passed me as I passed. But as my velocity moves away from the mean velocity, the "normal drivers" who are only going a little faster than me get undersampled, so I only see the drivers who are tearing around like a bat out of hell. At the lower end, I still see the real slow-pokes on the road, but I start seeing people who are going a bit faster than that, of which there are a lot more. The effect of this sample bias, it seems, would be to make it seem that I'm a farther outlier in my driving speed than I actually am.

So now, to figure out where I fall in the (normal) distribution of driving speeds, I need to know exactly what the mean driving speed and what sigma is, so that I can compensate for the abs(v-v0)
sampling factor. That means I need to figure out 2 numbers, but unfortunately, I only measured 1 number (that 1/6 of the passes while driving were me being passed) so I won't be able to properly constrain this problem. However, I should be able to figure out the mean on my drive home by finding the speed at which as many people pass me as I pass. For now, let's say this is 55 mph. Then all I need to do is find the sigma for which a gaussian distribution around 55 mph downweighted by abs(v-65 mph) has 1/6 of the area lying above 65 mph. I just solved that numerically on my computer, and it's saying that the best-fit sigma is ~22 mph. So that puts me at about +1/2 sigma. That seems reasonable.

An interesting next step (which I'm not going to do right now since I need to get to work) would be to translate the sample error in my pass measurements into an error in the determination of sigma, and then the error in my driving speed percentile.

Tuesday, June 2, 2009

What are the chances?

I cringe every time that I hear this phrase. I heard it most recently when my friend had her car stolen from San Juan. It was recovered in a semi-drivable state in Bayamon. She invested a couple thousand dollars to get rid of the "semi", only to have the car re-stolen a couple of months later. This time, when the car was recovered in Dorado, there was no "semi" to be had, so she's currently trying to sell it for pieces. While the police were fairly understanding (though less than helpful) the first time her car got stolen, the second time was occasion for all sorts of raised eyebrows and skepticism. And in exasperation my friend uttered the phrase in question.

"What are the chances" is a Pandora's box of bad statistics. Statistics is about hedging your bets given incomplete information, but this phrase is always uttered after the fact, when we have (relatively) complete information: it happened. So unless you plan on repeating the experiment, the chances are one. It happened.

As an example, let's take the famous Goat/Car Puzzle. There are 3 doors; one has a car behind it and the other two have goats. After you pick a door, the game-show host opens one of the other doors and reveals a goat. You are then offered the option of switching your choice to the other door. If you play this game repeatedly, you'll win more often if you switch your choice. But the instance just played out, the car was behind one of the doors, and if that was the door you picked, your chances of getting the car were 1. If you didn't, your chances were 0.

You might object: "What were the chances beforehand, when I didn't know where the goat and car were?". But to do that, you need to make some assumptions. You need to assume that at each playing of the game, the cars and goats are randomly assigned and/or you randomly pick doors. Otherwise, it might be the case that the car is always behind door #1, and you always pick door #2. Your chances of success wouldn't be so good in this case. You might have decent prior knowledge of how cars, goats, and doors are picked in this example, but for everyday occurrences, we usually have much more limited prior knowledge. Are cars randomly stolen, or are certain brands targeted? Are certain areas targeted? People often assume that these processes are random, but they rarely are. With limited priors on these events, the question "what are the chances" can't be answered with any certainty and any answers given should be taken with a great big shaker of salt.

Furthermore, people have selective attention. We ignore whole heaps of ordinary outcomes and only pay attention to ones that strike us as interesting. As a friend of mine once said: "Low probability stuff happens pretty regularly because stuff is happening all the time." Even if the processes involved are random, unlikely outcomes are to be expected if the processes are repeated often enough. People tend to ignore the ordinary outcomes, exclaim at the extraordinary ones, and then assume that something deeper is afoot. In my friend's case, the police started wondering if she was being personally targeted or if she was really bad at locking her car. But even if we assume a random model of car thefts, some number unlikely outcomes doesn't automatically imply that our random model is wrong.

Finally, we also need to keep in mind that in complex systems like real life, there may be a huge number of possible outcomes. But something has to happen. When you roll a die, each number only has a 1/6 chance of coming up. Would you roll a die once and then exclaim: "Wow, it came up six! What are the chances?" In real life, there might be millions of outcomes, each with one-in-a-million chance of coming true, but the fact that one of them happens shouldn't be surprising.

Fighting against all of the pitfalls inherent in asking "What are the chances?", I've developed a reflexive response: "What are the chances?" One.