[antlr-interest] error recovery
David-Sarah Hopwood
david-sarah at jacaranda.org
Tue Aug 18 16:38:20 PDT 2009
Adrian Ber wrote:
> Hi all,
>
> I'm doing another attempt with the hope that somebody knows the answer and he/she wants to share.
> I have an HTML grammar and I have a lexer rule for strings:
>
> STRING : '"' (ESC|~('"'|'\\'|'\n'|'\r'))* '"'
> | '\'' (ESC|~('\''|'\\'|'\n'|'\r'))* '\''
> ;
>
> But if I have a text in html like
> I'm just a text
> then the lexer will try to identify a string and it will generate an error.
>
> How can I solve this?
It depends whether you just want the answer to this specific question, or
whether you want to parse HTML ;-)
The single-quote character is valid because it is between tags.
STRING could be made a fragment rule that is only invoked within a
tag. Alternatively it could be guarded by a gated semantic predicate
that is only active within a tag. Or you could have separate
"island grammars" for tag and non-tag contexts.
However, this answer is just scratching the surface if you want to parse
full HTML -- even only correct HTML, never mind real-world "tag soup".
Personally, I wouldn't use ANTLR to parse HTML. There have been attempts
to do that, but they are rather limited and incomplete, and it would be a
*lot* of work to do better. I would use an existing HTML parser library
(and do a good deal of research into what users of various libraries have
said about them, and whether they satisfy your requirements for validation
etc.)
For Java, see <http://java-source.net/open-source/html-parsers> as a
starting point.
--
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
More information about the antlr-interest
mailing list