[antlr-interest] Parsing simple file
Gavin Lambert
antlr at mirality.co.nz
Sat Nov 29 16:16:59 PST 2008
At 12:33 30/11/2008, Guido Amabili wrote:
>I want to parse the string TEST 00125 . The result should be
>tokenized like this name=TEST jobId=00 and mailpieceId=125.
>The problem is that for token jobId, the lexer discards the
first
>three digits and matches the piece 25(with an
>UnwantedTokenException for 001) and for mailpieceId I get a
>MissingtokenException
[...]
> THREE_DIGIT_CODE
> : DIGIT DIGIT DIGIT
> ;
> TWO_DIGIT_CODE
> : DIGIT DIGIT
> ;
>
>ONE_DIGIT_CODE
> : DIGIT;
The key point about ANTLR that you seem unaware of is that all
lexing is done up front, with no influence from the parser
whatsoever.
So for the input "00125" and the lexer rules above, ANTLR will
create a THREE_DIGIT_CODE token and a TWO_DIGIT_CODE token, since
doing so will consume the most input at once each time.
For this sort of situation, probably what you ought to do is to
change DIGIT to be a top-level (non-fragment) rule and each of
THREE_DIGIT_CODE, TWO_DIGIT_CODE, and ONE_DIGIT_CODE to be parser
rules instead.
This does mean that you'll get multiple DIGIT tokens matched in
each rule, of course, but you can combine these later (eg. if
you're generating an AST). But since there's nothing lexically
obvious about the division between them (it's influenced by parser
structure) then the distinction belongs in the parser.
More information about the antlr-interest
mailing list