[antlr-interest] Parsing HTML pages

Vaclav Barta vbar at comp.cz
Thu Jul 3 10:42:13 PDT 2008


On Thursday 03 July 2008 11:44:53 Alexander Nicolaysen Sørnes wrote:
> På Torsdag 03 juli 2008 , 11:32:43 skrev Ana Nelson:
> > If you are trying to parse web pages remember that lots of people
> > write very bad HTML which doesn't conform to the standard, so it can
> > be very difficult to parse.
> Yes, that's why I thought something like ANTLR would be a good idea.
I'd say quite the opposite - ANTLR is designed to parse documents using 
complicated languages (e.g. programming languages) _exactly_ according to 
those language's specification. There are facilities to recover the generated  
parsers from errors, but most ANTLR-generated parsers do not ignore errors in 
ther input, while most HTML parsers do. I'd say forget ANTLR and use an 
existing HTML parser in your favorite language.

	Bye
		Vasek
--
http://www.mangrove.cz/
Open Source integration


More information about the antlr-interest mailing list