Wednesday, October 24, 2007

Google: Statistical machine translation

ArsTechnica has a mini review of Google's translation service. They have switched from the rules based machine translation that they used to have and most machine translation services use.

I remember studying linguistic rules and statistical machine translation methods in college. Like the article suggests neither one is great but they can work well enough for someone to feel their way to the actual translation. The linguistic rules approach parses texts into an intermediary state using former grammar rules of the source language. The intermediary text is transformed into the target language former grammar rules of the target language.

The main problem with the linguistic rules approach is that it is similar to taking a sentence and marking it a parse tree and rearranging it into the parse tree of the target language and then changing it word for word. Another major problem is the grammars do not do too well with slang since there may not be a direct translation. The other problem is one of syntax there may be structures missing from the source that are needed in the target. For example, to properly translate "I went to the store" from English to Russian one needs to know if I traveled on foot or in a vehicle, since that changes the verb.

The statistical approach basically uses an algorithm to weigh the probability of the part of speech and/or meaning of a word. The statistics can be modified with the help of volunteers marking up a sentence or providing a more accurate translation. Given enough corrections and a large enough corpora the system can improve. Google appears to be using their index of web pages as a potential corpora and users of the service as the volunteers instead of the usual college student looking for beer money.

Googles approach reminds me of a few journal articles on using web pages as an inexpensive means to develop a corpus. Most corpori are rather expensive proprietary collections of text of language in everyday use. The statistical approach seems like a no brainer for Google since they have a corpori lying around and harnessing users even a poor algorithm is bound to get better. The linguistic rules approach only gets better with the development of more elaborate syntactical and transformative rules. The only question is what took Google so long to figure this out?