[antlr-interest] Parsing simple file

Sat Nov 29 16:16:59 PST 2008

At 12:33 30/11/2008, Guido Amabili wrote:
 >I want to parse the string TEST 00125 . The result should be
 >tokenized like this name=TEST jobId=00 and mailpieceId=125.
 >The problem is that for token jobId, the lexer discards the 
first
 >three digits and matches the piece 25(with an
 >UnwantedTokenException for 001) and for mailpieceId I get a
 >MissingtokenException
[...]
 >  THREE_DIGIT_CODE
 >   : DIGIT DIGIT DIGIT
 >   ;
 >  TWO_DIGIT_CODE
 >   :  DIGIT DIGIT
 >   ;
 >
 >ONE_DIGIT_CODE
 >   : DIGIT;

The key point about ANTLR that you seem unaware of is that all 
lexing is done up front, with no influence from the parser 
whatsoever.

So for the input "00125" and the lexer rules above, ANTLR will 
create a THREE_DIGIT_CODE token and a TWO_DIGIT_CODE token, since 
doing so will consume the most input at once each time.

For this sort of situation, probably what you ought to do is to 
change DIGIT to be a top-level (non-fragment) rule and each of 
THREE_DIGIT_CODE, TWO_DIGIT_CODE, and ONE_DIGIT_CODE to be parser 
rules instead.

This does mean that you'll get multiple DIGIT tokens matched in 
each rule, of course, but you can combine these later (eg. if 
you're generating an AST).  But since there's nothing lexically 
obvious about the division between them (it's influenced by parser 
structure) then the distinction belongs in the parser.