Tuesday, July 14, 2009

The Voynich Manuscript: Bootstrapping Language

The internet is a powerful and dangerous thing. It all started when I read the latest xkcd, which warned me that visiting cracked.com was dangerous. Then I read about the "6 phenomena that science can't explain" (which was a very dramatic title for some underwhelming mysteries), and next thing I knew, I was reading about the Voynich Manuscript. I learned about cryptography, glossolalia, the Manchu language, among other things. Then I took a look at the manuscript and before I knew it, I had a transcribed version of the manuscript in electronic form using the European Voynich Alphabet. And it just went downhill from there.

To summarize, the Voynich Manuscript (hereafter VMS) is a handwritten text with some illustrations some 500 years old. It uses glyphs no one knows how to read, it is not clear if it corresponds to a known language, it may or may not be encrypted, and little progress has been made in deciphering any of it, despite the fact that some bright people have tried. So I decided to have a crack at it.

The reason I got interested is because of the similaries to SETI. Arecibo, back in 1974, transmitted a message off into space that had been designed to be decrypted. We might receive a message like that some day. Or we might intercept something much like the VMS--a bunch of data in a language that we have no prior knowledge of--and we may be finding ourself trying to figure out how to bootstrap a language. That is to say, to learn the grammar and semantics of a language from a static example, without outside help.

Is this possible? For grammar, I'm pretty sure of it. I can imagine an algorithm (maybe Maximum Entropy Modeling and Bayesian learning applied to grouping and parsing) that uses correlations in the appearances of language elements (starting with letters and building up) and correlations in the behaviors of these elements relative to one another to build a model for parsing a language. For the VMS, I used something similar to this (not the MEM and Bayesian part) to show that spaces, line breaks, and paragraph breaks show similar grouping correlations relative to other VMS letters, and so can probably be considered one grammatical element of whitespace. That's a pretty simple thing to deduce, but it was actually something I was worried about in getting started with the VMS.

Sematics is another issue. Once upon a time, I would have had an optomistic answer to bootstrapping sematics from text that was not written for that purpose. However, after watching my child mysteriously acquire language, illustrating how hard-wired the human brain is for learning language from another human, and how much it relies on shared experience and feedback, I'm less sure.

I would be interested to know if there is a field of mathematics that studies sematics and the properties that a self-contained system needs to have to be able to generate sematical relationships. The Arecibo message relied on a shared physical environment to try to bootstrap sematics. I wonder if it would be enough to describe the rules of the grammar of a language in the language itself. That one, once the reader had deduced the relationships between elements, you would have a shared knowledge of that subject that might enable a reader to correlate the structure of the descriptions with the grammatical structure and thereby establish the first sematical relationships.

Anyway, after preliminary analysis of the VMS, I'm pretty sure that it's not random gibberish (there are correlations between elements on levels ranging from letters to words), and if it's encrypted, it's a weak form of encryption that preserves these correlations. My pet theory, extended from the glossalalia idea, is that this is actually plaintext in a natural language with an invented set of symbols, but that the natural language might be the accidental or intentional creation of a savant or scholar.

Thursday, July 9, 2009

Countability and Strong Positive Anymore

Seven or eight years ago, I discovered that I have a linguistic condition called "Positive Anymore." I was in college, chatting with my roommates, and said something like "Anymore, I just lift weights in the Leverett gym." One of my roommates just couldn't take it anymore. "I've heard you say 'anymore' like that for years now. What the hell does it mean?" A quick poll of those present revealed that I was the only one for whom that construction made grammatical sense. My aunt (a linguistics professor) gave me the prognosis: I had Positive Anymore.

I used to think that PA was a linguistic shortcoming of mine, but anymore I'm convinced it's more like a superpower. Whereas most English speakers can only use the word in negative constructions like "I don't drive anymore" I have the uncanny ability to use it positively as per the first sentence in this paragraph. Moreover, I don't just have PA, I have strong PA, which means I can, at will, detach "anymore" and put it anywhere in a sentence. "Anymore, I just take the bus." Astounded yet?
If you're still having trouble parsing that, replace "anymore" with "nowadays"--to me, they mean about the same thing.

I receive no end of flak from friends, relatives, and spouses about my PA, although it's not that uncommon a condition. In fact, I have caught several of my relatives (mostly on my dad's side) using PA even after making fun of me for my PA. They have it and don't even know it.

Anyway, S and I were trying to figure out if my use of PA was inconsistent with my use of "any", which I use according to the standard rules. But it turns out the standard rules are weirder than you might think. For example, "I don't want any spam" is a grammatical negative construction; "I want any spam" is ungrammatical and positive. "Do you want any spam?" is grammatical and seems positive, but it turns out that there is an implied negative in English owing to the uncertainty inherent in questions and subjunctives. Hmph. And then what about "I like any spam I can get?" That seems positive again, but the clause "I can get" is required to make it grammatical. And then the plot thickens. "I feed spam to any dog" is grammatical and did not require a clause to modify "any dog."

Our theory was that using "any" positively without a clause requires the noun modified by "any" to come in quantized units--to be countable. Dogs are countable; spam is a continuum, much like water or space-time. The best example we could think of that illustrated this was "fish". Fish can be countable noun (number of live fishes) or continuum noun (amount of dead fish to eat). If I say "I'll take any fish," the ambiguity in the countability of fish is broken--it's clear I'm talking about a live fish (or, perhaps, a type of fish, which is also countable). But if I say "I'll take any fish that you give me," the ambiguity is preserved.

The implication was that for my use of PA to be consistent with the standard use of "any" (which it may not need to be, since "anymore" is one word, not "any more"), I must be thinking of the time interval referred to by PA (i.e. now and continuing indefinitely into the future) as something countable rather than continuous. Maybe. I don't know. Anymore, I'm just really confused.

Thursday, July 2, 2009

The Need for Speed

On the drive from San Juan to Arecibo this morning, I got to wondering about where my average driving speed fell in the distribution of drivers here in Puerto Rico. In the states, I felt like I was a pretty average driver, but here en la isla, the distribution of driving speeds is different. There are a lot of fast drivers, too be sure, but there is also a subpopulation of drivers whose speed is significantly (~10 mph) below the speed limit. This may be because relative to the US, PR is economically depressed and so more old cars are on the road, or as a reaction to the more erratic driving habits there seem to be here, but anyway, I definitely pass more people than pass me now.

So in an effort to discover where my driving speed fell relative to others (and in and effort to alleviate the boredom of driving 1.5 hrs alone), I started counting how many cars I passed and how many passed me as I was going 65 mph (the speed limit). Out of 55 pass events, only 9 involved me getting passed. To make this a tractable problem in my head, I decided to assume that driving speeds were normally (gaussian) distributed about a mean--even though this contradicts my anecdotal evidence above. Using this approximation, my first instinct was to say that 1/6 of the cars on the road were faster than me, and since ~2/3 of samples are within +/- 1 sigma of the mean in a gaussian distribution, 1/6 of the samples would be above +1 sigma. So I approximated that I was a 1 sigma driver.

But then it occurred to me that I needed to control for a significant sample bias. This is because the test I was doing wasn't randomly selecting cars and comparing my speed to them. Cars were far more likely to get selected if the difference between their speed and my speed was large. A car going the same speed as me would never pass me, and I would never pass it. But I would assuredly pass almost every car on the road that was going 10 mph as I went 65. The "road distance" that I sampled for different velocities is proportional to abs(v-v0), where v0 is my velocity. The effect this had on my samples was to underweight speeds close to my own and overweight the wings of the gaussian distribution I had assumed as my model. If I drove exactly the mean velocity, this effect would not be terribly important--if the model were correct, it would still be the case that as many cars passed me as I passed. But as my velocity moves away from the mean velocity, the "normal drivers" who are only going a little faster than me get undersampled, so I only see the drivers who are tearing around like a bat out of hell. At the lower end, I still see the real slow-pokes on the road, but I start seeing people who are going a bit faster than that, of which there are a lot more. The effect of this sample bias, it seems, would be to make it seem that I'm a farther outlier in my driving speed than I actually am.

So now, to figure out where I fall in the (normal) distribution of driving speeds, I need to know exactly what the mean driving speed and what sigma is, so that I can compensate for the abs(v-v0)
sampling factor. That means I need to figure out 2 numbers, but unfortunately, I only measured 1 number (that 1/6 of the passes while driving were me being passed) so I won't be able to properly constrain this problem. However, I should be able to figure out the mean on my drive home by finding the speed at which as many people pass me as I pass. For now, let's say this is 55 mph. Then all I need to do is find the sigma for which a gaussian distribution around 55 mph downweighted by abs(v-65 mph) has 1/6 of the area lying above 65 mph. I just solved that numerically on my computer, and it's saying that the best-fit sigma is ~22 mph. So that puts me at about +1/2 sigma. That seems reasonable.

An interesting next step (which I'm not going to do right now since I need to get to work) would be to translate the sample error in my pass measurements into an error in the determination of sigma, and then the error in my driving speed percentile.