[antlr-interest] Need help parsing text format

Mon Apr 30 05:00:53 PDT 2007

At 23:46 30/04/2007, Isabelle Muszynski wrote:
 >My grammar is shown at the end of this mail.
 >The problem is that it won't parse the cases where alphanumeric
 >fields only contain for ex. letters :
 >
 >BCSABC/12.CHARLIE/HAM-BRE.1/bla&&^^%%$$bla.3
[...]
 >fragment DIGIT 	: '0'..'9'	;
 >	
 >fragment LETTER :	'a'..'z' | 'A'..'Z' ;	
 >
 >fragment ALPHA 	:	LETTER | DIGIT;
 >
 >fragment ANY_CHAR :	ALPHA | SPECIAL_CHAR ;
 >
 >fragment WS_CHAR  : (' ' | '\t' | '\u000C' ) ;
 >
 >WS      :       WS_CHAR+;
 >
 >NEWLINE  : '\r'? '\n' ;
 >
 >LETTER_WORD :	LETTER+ ;
 >NUMBER_WORD  :	DIGIT+ ;
 >ALPHA_WORD  : ALPHA+ ;
 >FREE_WORD : ANY_CHAR+ ;

I think this is the problem: there is ambiguity between ALPHA_WORD 
and both of LETTER_WORD and NUMBER_WORD, since each is a subset of 
ALPHA_WORD.  Since they're all productions and it can't change its 
mind and become a different token later on, it has to guess which 
one to use.  I suspect (given the order here) it'll pick 
LETTER_WORD if it consists only of letters, and NUMBER_WORD if it 
consists only of numbers.  It might pick ALPHA_WORD for mixed 
cases but then it also might generate a combination of 
LETTER_WORDs and NUMBER_WORDs instead -- that's another ambiguity.

The same applies for FREE_WORD, since that's another superset.

There are dodges you can use to get around this sort of thing, but 
in this case since your grammar is so interwoven I think you might 
be better off making a minimalist lexer and do most of the heavy 
lifting in the parser instead.  In other words, remove the ALPHA, 
ALPHA_WORD, ANY_CHAR, and FREE_WORD rules, and make SPECIAL_CHAR 
an output rule (not a fragment).  Then make equivalents to 
ALPHA_WORD, and FREE_WORD as parser rules instead of lexer rules.

You could even remove all the WORD rules and make each lexer token 
only contain a single character (except possibly for 
whitespace).  Probably wouldn't make a lot of difference in this 
case :)