[antlr-interest] Newbie conundrums

Tue Aug 21 20:48:14 PDT 2007

Jan-Willem van den Broek wrote:

> The grammar below describes this text:
> 
> grammar Silly;
> 
> line : CODE TIME NUMBER ;
> 
> CODE : D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D ;
> TIME : D D C D D C D D ;
> NUMBER : D+ ;
> 
> fragment D : '0'..'9' ;
> fragment C : ':' ;
 >
 > The second awkward thing is that this grammar doesn't work :-)
 >

In this case, you probably want to handle everything in the lexer. 
Replace "line" with "LINE" and you will find that ANTLR does what you 
want it to do.

 > If I test it (in ANTLRWorks) on
 > "01234567890123456789012345678912:59:300123456789012345678901234567",
 > then everything before the ":" is matched by the NUMBER rule. Of
 > course this is completely logical for the lexer to do, but it still
 > sucks. The filter option (and giving precedence to CODE over NUMBER)
 > doesn't really solve the problem either, since it's perfectly valid to
 > have the number be of length 30 too, and in that scenario the number
 > would result in a CODE token.

In ANTLR, the lexer and the parser are really the same mechanism.  To 
understand the behavior above, think of your grammar as two separate 
programs.  You have the lexer trying to parse CODE, TIME, and NUMBER. 
The lexer doesn't know the parser exists.  It reads a long string of 
numbers, hits a ":", and thinks "ok, everything up to the ":" fits 
nothing but NUMBER, so make that string a number."  It has no idea that 
you are interested in putting a TIME after a CODE.  It just matches the 
longest string that it can.

If you put LINE in the lexer, now the lexer understands what context you 
want CODE, TIME, and NUMBER.

 > Finally, the surprising thing is that if I try this parser (again in
 > ANTLRWorks) on the following input below (notice the "fff" between the
 > code and the time), then it parses code, time, and number, without
 > complain, ignoring the "ffff" in the process. Is this automatic error
 > correction? If so, can I turn on warnings somewhere?
 >
 > The input is: 
"012345678901234567890123456789fff12:59:300123456789012345678901234567"

In this case the exact same thing is happening that happened above.  The 
lexer does not match the "fff" to a token because "fff" doesn't match 
anything in your grammar, but it does serve to break up your string. 
ANTLR hits the "f" and thinks, "geez, now nothing matches, but before 
the 'f' the first match I could make is a CODE, so lets tokenize it as a 
CODE."  Note that if you placed NUMBER before CODE in the grammar, this 
string would match as a NUMBER and nothing would match as a CODE.  ANTLR 
then ignores the remaining 'f' and then starts matching the TIME and 
NUMBER.  "CODE TIME NUMBER" is returned to the parser and your parser 
finds a match.

When you are trying break complex strings into parts I have found that 
it is always best to deal with it in the lexer.  Depending on your app, 
you may not even need a parser.

-Chris