The internet is a powerful and dangerous thing. It all started when I read the latest xkcd, which warned me that visiting cracked.com was dangerous. Then I read about the "6 phenomena that science can't explain" (which was a very dramatic title for some underwhelming mysteries), and next thing I knew, I was reading about the Voynich Manuscript. I learned about cryptography, glossolalia, the Manchu language, among other things. Then I took a look at the manuscript and before I knew it, I had a transcribed version of the manuscript in electronic form using the European Voynich Alphabet. And it just went downhill from there.
To summarize, the Voynich Manuscript (hereafter VMS) is a handwritten text with some illustrations some 500 years old. It uses glyphs no one knows how to read, it is not clear if it corresponds to a known language, it may or may not be encrypted, and little progress has been made in deciphering any of it, despite the fact that some bright people have tried. So I decided to have a crack at it.
The reason I got interested is because of the similaries to SETI. Arecibo, back in 1974, transmitted a message off into space that had been designed to be decrypted. We might receive a message like that some day. Or we might intercept something much like the VMS--a bunch of data in a language that we have no prior knowledge of--and we may be finding ourself trying to figure out how to bootstrap a language. That is to say, to learn the grammar and semantics of a language from a static example, without outside help.
Is this possible? For grammar, I'm pretty sure of it. I can imagine an algorithm (maybe Maximum Entropy Modeling and Bayesian learning applied to grouping and parsing) that uses correlations in the appearances of language elements (starting with letters and building up) and correlations in the behaviors of these elements relative to one another to build a model for parsing a language. For the VMS, I used something similar to this (not the MEM and Bayesian part) to show that spaces, line breaks, and paragraph breaks show similar grouping correlations relative to other VMS letters, and so can probably be considered one grammatical element of whitespace. That's a pretty simple thing to deduce, but it was actually something I was worried about in getting started with the VMS.
Sematics is another issue. Once upon a time, I would have had an optomistic answer to bootstrapping sematics from text that was not written for that purpose. However, after watching my child mysteriously acquire language, illustrating how hard-wired the human brain is for learning language from another human, and how much it relies on shared experience and feedback, I'm less sure.
I would be interested to know if there is a field of mathematics that studies sematics and the properties that a self-contained system needs to have to be able to generate sematical relationships. The Arecibo message relied on a shared physical environment to try to bootstrap sematics. I wonder if it would be enough to describe the rules of the grammar of a language in the language itself. That one, once the reader had deduced the relationships between elements, you would have a shared knowledge of that subject that might enable a reader to correlate the structure of the descriptions with the grammatical structure and thereby establish the first sematical relationships.
Anyway, after preliminary analysis of the VMS, I'm pretty sure that it's not random gibberish (there are correlations between elements on levels ranging from letters to words), and if it's encrypted, it's a weak form of encryption that preserves these correlations. My pet theory, extended from the glossalalia idea, is that this is actually plaintext in a natural language with an invented set of symbols, but that the natural language might be the accidental or intentional creation of a savant or scholar.