[antlr-interest] Trouble with grammar

Wed May 9 13:23:14 PDT 2007

At 04:56 10/05/2007, Alex Shneyderman wrote:
 >gspTag
 >:
 >     '<' NS ':' TAG '>' TEXT '</' NS ':' TAG '>'
 >;
 >
 >TEXT: (~'<')+;
 >NS:  (~':')+ ;
 >TAG: ('a'..'z'|'A'..'Z')+;
 >
 >WS: (' '|'\t'|'\r'|'\n')+ {skip();} ;
 >
 >On this input:
 >
 ><g:test>adfas</g:test>

Lexing is performed prior to any parsing, and the three tokens 
TEXT, NS, and TAG are nearly indistinguishable from each other 
(you should have gotten ambiguity warnings about that).  So 
letters such as 'g' can end up part of any of the three tags, not 
necessarily NS.  And in fact the entire sequence 'g:test>adfas' 
could be made into a single TEXT token.  (What's most likely 
happening though is that '<g' is turned into an NS and then 
':test>adfas' into a TEXT, and so forth.)

You need to do your lexing in a context-free manner -- eg. make 
separate tokens for groups of alphanumeric text and the various 
symbols you're trying to recognise.  (You can make separate tokens 
for letters and numbers if you want to, and combine them later, 
but at least for the grammar above that doesn't seem 
necessary.)  You should try to avoid overlaps between tokens 
wherever possible.

Have a look at the examples supplied with ANTLR.