[antlr-interest] IDENTifier rule not working for some tokens

Tue Oct 21 09:04:47 PDT 2008

On Tue, 2008-10-21 at 11:28 +0200, brainstorm wrote:

> grammar CL;
> 
> options
> {
>         output = AST;
> //      backtrack = true;

Don't use this unless there is no other readable way.

> //      caseSensitive = false;  // no case insensitivity implemented
> yet on ANTLRv3, see: 
>                                 //
> http://www.antlr.org/wiki/pages/viewpage.action?pageId=1782
>                                 // and:
>                                 //
> http://www.antlr.org/pipermail/antlr-interest/2007-January/019008.html
>                                 //
>                                 // Update: There's no need for the
> above howto's, defining the WS special token (below) is sufficient

I am not sure how your WS token definition helps you with case
insensitivity, however, calling the builtin method in the C runtime to
ask for case insensitive matching is a lot easier than specifying all
combinations of tokens, which may be why your 'bb' isn't working. Adding
the override for Java is the same.

If you are using ANTLRWorks, are you using the interpreter or the
debugger - use the debugger pretty much always - the interpreter is
handy for a quick verification of a sub rule and so on.

>                                 
> //      memoize = true;
> }
> 
> tokens {
>         IDENT;

Take this IDENT def out too, you are defining it in the lexer.

> //      PROGRAM; ENDPROGRAM; //...
>         // waste of time, better define tokens on the grammar using ''
>         // ANTLR will generate this list for us (CL.tokens).
> }

Especially when you are new to ANTLR 3 it is better not to use literals
in the grammar but to define tokens in the lexer. The ambiguities
between lexer tokens are obfuscated when using grammar level literals
and typos change things that are difficult to discover.

> 
> program:        'PROGRAM'^ l_dec_vars l_dec_blocs l_instrs
> 'ENDPROGRAM'!;
>         
> l_dec_vars:     ('VARS'! (dec_var)* 'ENDVARS'!|);
>         
> dec_var:        IDENT^ constr_type;
> 
> l_dec_blocs:    (dec_bloc)*;
> 
> dec_bloc:       'PROCEDURE'^ header_proc l_dec_vars l_dec_blocs
> l_instrs 'ENDPROCEDURE'! 

This is where literals start to get confusing 'PROCEDURE'^ vs PROCEDURE
and a lexer definition.

> 
> constr_type:    ('INT' | 'Int' | 'int')  |  ('BOOL'|'bool')

And this is probably where your bb goes wrong. I would bet that the
lexer is starting to match 'bool' and gives up. Also, your case
insensitivity is inconsistent with Int and Bool. These really need to be
( ('I'|'i')('N'|'n') and so on.

> expsimple:      IDENT^ (func_call|) ('['^ expression ']'!)* ('.'^
> IDENT ('['^ expression ']'!|))*       //XXX: match bb as identifier

You probably are not getting here for input 'bb'.

> 
> IDENT   :       'a'..'z' ('a'..'z'|'A'..'Z'|'0'..'9')* ;
> INTCONST:       ('0'..'9')+ ;
> 
> WS      :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;};
> COMMA   :       ',';
> STRING  :       '~[]*'; // XXX

Not sure what STRING is meant to match, but you know it matches the
literal string ~[]* right?

So:

1) Use the case insensitive match routine and specify our lexer tokens
in UPPERCASE;
2) Take the literals out and make real lexer tokens;

Then you will at least be able to see what is going wrong (but my
felling is that it won't go wrong when you do that unless your STRING
rule is really ~('['| ']')* ? If you are just trying to pick up any
characters that are not matched by anything else, just use:

unknown : ANY+;
ANY : . ;

Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081021/4f45b837/attachment.html