[antlr-interest] Need help parsing text format
Gavin Lambert
antlr at mirality.co.nz
Mon Apr 30 05:00:53 PDT 2007
At 23:46 30/04/2007, Isabelle Muszynski wrote:
>My grammar is shown at the end of this mail.
>The problem is that it won't parse the cases where alphanumeric
>fields only contain for ex. letters :
>
>BCSABC/12.CHARLIE/HAM-BRE.1/bla&&^^%%$$bla.3
[...]
>fragment DIGIT : '0'..'9' ;
>
>fragment LETTER : 'a'..'z' | 'A'..'Z' ;
>
>fragment ALPHA : LETTER | DIGIT;
>
>fragment ANY_CHAR : ALPHA | SPECIAL_CHAR ;
>
>fragment WS_CHAR : (' ' | '\t' | '\u000C' ) ;
>
>WS : WS_CHAR+;
>
>NEWLINE : '\r'? '\n' ;
>
>LETTER_WORD : LETTER+ ;
>NUMBER_WORD : DIGIT+ ;
>ALPHA_WORD : ALPHA+ ;
>FREE_WORD : ANY_CHAR+ ;
I think this is the problem: there is ambiguity between ALPHA_WORD
and both of LETTER_WORD and NUMBER_WORD, since each is a subset of
ALPHA_WORD. Since they're all productions and it can't change its
mind and become a different token later on, it has to guess which
one to use. I suspect (given the order here) it'll pick
LETTER_WORD if it consists only of letters, and NUMBER_WORD if it
consists only of numbers. It might pick ALPHA_WORD for mixed
cases but then it also might generate a combination of
LETTER_WORDs and NUMBER_WORDs instead -- that's another ambiguity.
The same applies for FREE_WORD, since that's another superset.
There are dodges you can use to get around this sort of thing, but
in this case since your grammar is so interwoven I think you might
be better off making a minimalist lexer and do most of the heavy
lifting in the parser instead. In other words, remove the ALPHA,
ALPHA_WORD, ANY_CHAR, and FREE_WORD rules, and make SPECIAL_CHAR
an output rule (not a fragment). Then make equivalents to
ALPHA_WORD, and FREE_WORD as parser rules instead of lexer rules.
You could even remove all the WORD rules and make each lexer token
only contain a single character (except possibly for
whitespace). Probably wouldn't make a lot of difference in this
case :)
More information about the antlr-interest
mailing list