[antlr-interest] error recovery

Tue Aug 18 16:38:20 PDT 2009

Adrian Ber wrote:
> Hi all,
> 
> I'm doing another attempt with the hope that somebody knows the answer and he/she wants to share.
> I have an HTML grammar and I have a lexer rule for strings:
> 
> STRING  : '"' (ESC|~('"'|'\\'|'\n'|'\r'))* '"'
>         | '\'' (ESC|~('\''|'\\'|'\n'|'\r'))* '\''
>         ;
> 
> But if I have a text in html like 
>     I'm just a text
> then the lexer will try to identify a string and it will generate an error.
> 
> How can I solve this?

It depends whether you just want the answer to this specific question, or
whether you want to parse HTML ;-)

The single-quote character is valid because it is between tags.
STRING could be made a fragment rule that is only invoked within a
tag. Alternatively it could be guarded by a gated semantic predicate
that is only active within a tag. Or you could have separate
"island grammars" for tag and non-tag contexts.

However, this answer is just scratching the surface if you want to parse
full HTML -- even only correct HTML, never mind real-world "tag soup".

Personally, I wouldn't use ANTLR to parse HTML. There have been attempts
to do that, but they are rather limited and incomplete, and it would be a
*lot* of work to do better. I would use an existing HTML parser library
(and do a good deal of research into what users of various libraries have
said about them, and whether they satisfy your requirements for validation
etc.)

For Java, see <http://java-source.net/open-source/html-parsers> as a
starting point.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com