[antlr-interest] on "crap" grammars

Justin Murray jmurray at aerotech.com
Thu Jul 21 09:42:20 PDT 2011


As Jim pointed out, your problem with tokens showing up in error 
messages as <invalid> is because you just inlined lexer tokens (in your 
"type" rule) without giving them a name. Try making two real lexer rules 
with the names you would like to see:

INT : 'int';
FLOAT : 'float';
type : INT | FLOAT;

If you look at the generated C code, you will see how it determines the 
string to use from this name. It is also fairly simple to override the 
printed string on a case by case basis if that seems appropriate for 
your errors. This may be necessary if you discover that the #defines 
generated for INT and FLOAT conflict with other defines used in your 
code and libraries. You can solve this generically by adding an 
underscore to the end of the name (INT_ and FLOAT_), and then just strip 
off the last character in your error handler.

- Justin

On 7/21/2011 11:42 AM, Vlad wrote:
> This test grammar was called "crap" by Jim Idle. I am willing to eat the humble pie and admit where I am an ANTLR novice or don't know something about grammars, but I am just not seeing it in this simple case:
>
> grammar testerrors;
>
> options
> {
>      language='C';
> }
>
> NAME    :   ( 'a'..'z' | 'A'..'Z' | '0'..'9' )+ ;
> WS      :   ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
>
> parse:
>      decl ( options { greedy = true; }: ',' decl )* ','? EOF
>      ;
>
> decl:
>      NAME ':' type
>      ;
>
> type:
>      'int' | 'float'
>      ;
>
> The start symbol is a comma-delimited list of simple '<name>  :<type>' declarations and allows the list to optionally end in a comma as is done in some languages (Python, etc). This is a pretty common way to structure it. In JavaCC, for example, you'd use a local LOOKAHEAD(2) inside the ()* to disambiguate the choice between matching one more decl or ending the list. Without it and with the default k=1, JavaCC emits an ambiguity warning at parser generation time. In ANTLR case, the ambiguity can be dealt with similarly, with a local k=2 option or the way done above (which I borrowed from http://www.antlr.org/grammar/1200715779785/Python.g). Without either, ANTLR also emits a warning at parser generation time. All of this seems to work as expected.
>
> So, what is so obviously wrong with the grammar snippet that deserves the "crap" moniker? I am learning ANTLR because I want to add a multi-target parser generator tool to my skill set. For Java work, JavaCC is still out there and generates fast parsers, has good error handling, and can build ASTs/visitors. In C++, I would normally do a simple case like this via boost.spirit but it's a bit of a template metaprogramming monster. With ANTLR I am successfully compiling my C parser within a larger C++ codebase and the only learning curve issues are odd error messages on relatively trivial input errors, where ANTLR can't seem to identify the token it is expecting. E.g., input "name : bad" results in
>
> -memory-(1)  : error 10 : Unexpected token, at offset 6
>      near [Index: 0 (Start: 0-Stop: 0) ='<missing<invalid>>', type<0>  Line: 1 LinePos:6]
>       : Missing<invalid>
>
> I would be happy to get specific pointers to docs and articles on how to improve error handling by ANTLR *C* parsers. At least being able to modify the stock error display function to tackle the common case of mis-spelling a token name would be great.
>
> Thank you,
> Vlad
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address


More information about the antlr-interest mailing list