Monday, July 24, 2006

It's all about semantics

Peter Norvig, CEO of Google, posed a few tough questions to Tim Berners-Lee, inventor of the World Wide Web (WWW) at the recent conference for the American Association for Artificial Intelligence. As any one who is deeply involved in information retrieval or metadata research know Berners-Lee is pushing the Semantic Web as the biggest thing since teh Interweb. The basic idea of the Semantic Web (SW) is that web pages will be written in a way that computers can extract meaning from them in order to perform various operations. For example, your computer could automatically buy tickets for the movie, you’ve been reading about on the Internet, at your favorite theater at a time that will fit in with your schedule. Berners-Lee first described it in his landmark article for Scientific American.

Norvig’s criticism of the Semantic Web is that proponents need to factor into account incompetence, unwillingness to comply and dishonesty. The incompetence that he mentions is a problem on the World Wide Web now with web designers using non-standard markup technique. The unwillingness to comply or competition approach has been seen with HTML and I doubt SW will be any different. If one were the leader in selling widgets web wide why would one rewrite their entire website so that competitors could search their data. The deception aspect is pretty obvious to anyone who has clicked on a link high in their search results and being directed to a porno site when that isn’t what they were searching to find. He had more criticisms in a paper in 2005 but I guess you don’t want to berate the father of the WWW at a conference during the Q & A.

Berners-Lee took on the criticisms one by one, on compliance he suggested that powerful search engine companies can force others to reveal their data to store in Resource Description Frameworks (RDF). RDFs are part of the backbone of the SW and I’ll get to those shortly. On security / deception he pointed out that the Semantic Web requires explicit digital signatures on files. This would allow Semantic Web engines to only index trusted signatures and ignore unsigned Semantic Web pages.

The astute reader will notice that Berners-Lee did not touch incompetence. Some of the people who have commented on this story before and some of the people who I have talked to who are Semantic Web true believers say that advanced authoring tools will solve this. To which I say, B.S. if advanced tools could solve this problem why aren’t all web pages valid? Sure you have some that are still hand coded but most web masters are using DreamWeaver or some other WYSIWYG web editor. The problem of course is that different browsers treat tags differently in some cases the same browser will treat the same page differently depending on the version of the browser.

There are problems dealing with semantics in the semantic web which nyone who has studied natural language processing would expect. I don't think that the W3C has found a magic bullet to solve language translation problems but it still could be useful. Pushing the boundaries of technology in such a way could spur on some real developments in natural language processing and information retrieval. Eventually the SW may be a reality in a few more decades.

No comments: