[antlr-interest] Trouble with grammar
Gavin Lambert
antlr at mirality.co.nz
Wed May 9 13:23:14 PDT 2007
At 04:56 10/05/2007, Alex Shneyderman wrote:
>gspTag
>:
> '<' NS ':' TAG '>' TEXT '</' NS ':' TAG '>'
>;
>
>TEXT: (~'<')+;
>NS: (~':')+ ;
>TAG: ('a'..'z'|'A'..'Z')+;
>
>WS: (' '|'\t'|'\r'|'\n')+ {skip();} ;
>
>On this input:
>
><g:test>adfas</g:test>
Lexing is performed prior to any parsing, and the three tokens
TEXT, NS, and TAG are nearly indistinguishable from each other
(you should have gotten ambiguity warnings about that). So
letters such as 'g' can end up part of any of the three tags, not
necessarily NS. And in fact the entire sequence 'g:test>adfas'
could be made into a single TEXT token. (What's most likely
happening though is that '<g' is turned into an NS and then
':test>adfas' into a TEXT, and so forth.)
You need to do your lexing in a context-free manner -- eg. make
separate tokens for groups of alphanumeric text and the various
symbols you're trying to recognise. (You can make separate tokens
for letters and numbers if you want to, and combine them later,
but at least for the grammar above that doesn't seem
necessary.) You should try to avoid overlaps between tokens
wherever possible.
Have a look at the examples supplied with ANTLR.
More information about the antlr-interest
mailing list