Sunday, November 05, 2006

The Language Thing

There was a bit of a translation controversy in the news. The outsourcing of the translation of Iraqi documents to amateurs on the Internet leading to the publication of nuclear weapon blue prints is kind of predictable, after all Saddam Hussein's nuclear program prior to 1991 is the reason Iraq was under sanctions. Open source intelligence has its uses, but you'd have to be pretty stupid to publish classified information that you haven't translated. That was the whole point of the site but that ground has already been covered.

I've been playing around with some more interesting language resources recently. The British National Corpus is a free corpus that can be analyzed for linguistic information. A free corpus is a rarity, usually they cost a few thousand dollars or you have to create your own. It's a pretty fun site in a language nerd kind of way. The Linguistic Data Consortium has some very interesting articles and a decent collection of corpora. New articles are free and the membership is reasonable if you have an organization paying for it. I've been playing around with the sites a bit checking the frequency of words, POS, reading articles, and anything else I can think up. Even if you don't have any experience with linguistics or natural language processing you should have some amusement from the two sites. At the very least you don't have to worry about revealing classified information.