[antlr-interest] on "crap" grammars

Thu Jul 21 08:42:09 PDT 2011

This test grammar was called "crap" by Jim Idle. I am willing to eat the humble pie and admit where I am an ANTLR novice or don't know something about grammars, but I am just not seeing it in this simple case:

grammar testerrors;

options
{
    language='C';
}

NAME    :   ( 'a'..'z' | 'A'..'Z' | '0'..'9' )+ ;
WS      :   ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; } ;

parse:
    decl ( options { greedy = true; }: ',' decl )* ','? EOF
    ;

decl:
    NAME ':' type
    ;

type:
    'int' | 'float'
    ; 

The start symbol is a comma-delimited list of simple '<name> : <type>' declarations and allows the list to optionally end in a comma as is done in some languages (Python, etc). This is a pretty common way to structure it. In JavaCC, for example, you'd use a local LOOKAHEAD(2) inside the ()* to disambiguate the choice between matching one more decl or ending the list. Without it and with the default k=1, JavaCC emits an ambiguity warning at parser generation time. In ANTLR case, the ambiguity can be dealt with similarly, with a local k=2 option or the way done above (which I borrowed from http://www.antlr.org/grammar/1200715779785/Python.g). Without either, ANTLR also emits a warning at parser generation time. All of this seems to work as expected.

So, what is so obviously wrong with the grammar snippet that deserves the "crap" moniker? I am learning ANTLR because I want to add a multi-target parser generator tool to my skill set. For Java work, JavaCC is still out there and generates fast parsers, has good error handling, and can build ASTs/visitors. In C++, I would normally do a simple case like this via boost.spirit but it's a bit of a template metaprogramming monster. With ANTLR I am successfully compiling my C parser within a larger C++ codebase and the only learning curve issues are odd error messages on relatively trivial input errors, where ANTLR can't seem to identify the token it is expecting. E.g., input "name : bad" results in

-memory-(1)  : error 10 : Unexpected token, at offset 6
    near [Index: 0 (Start: 0-Stop: 0) ='<missing <invalid>>', type<0> Line: 1 LinePos:6]
     : Missing <invalid>

I would be happy to get specific pointers to docs and articles on how to improve error handling by ANTLR *C* parsers. At least being able to modify the stock error display function to tackle the common case of mis-spelling a token name would be great.

Thank you,
Vlad