[antlr-interest] IDENTifier rule not working for some tokens

brainstorm braincode at gmail.com
Wed Oct 22 15:27:02 PDT 2008


Thanks a lot for your suggestions ! :) I'm starting a new svn branch
with your advices, but I have some questions... (replying/asking
between lines).

On Tue, Oct 21, 2008 at 6:04 PM, Jim Idle <jimi at temporal-wave.com> wrote:
> On Tue, 2008-10-21 at 11:28 +0200, brainstorm wrote:
>
> grammar CL;
>
> options
> {
>         output = AST;
> //      backtrack = true;
>
> Don't use this unless there is no other readable way.


What do you mean by that ? By the way, looks like it's the preferred
way for ANTLR if output is not defined:

warning(149): CL.g:0:0: rewrite syntax or operator with no output
option; setting output=AST



>
> If you are using ANTLRWorks, are you using the interpreter or the debugger -
> use the debugger pretty much always - the interpreter is handy for a quick
> verification of a sub rule and so on.
>


Yes, I do use ANTLRWorks with the jp{2,4} files I attached on my
previous email :)


>
> Especially when you are new to ANTLR 3 it is better not to use literals in
> the grammar but to define tokens in the lexer. The ambiguities between lexer
> tokens are obfuscated when using grammar level literals and typos change
> things that are difficult to discover.
>


I really see your point (and I'm applying it in my grammar), but have
a look at public grammars on:

http://www.antlr.org/grammar/list

Take the first, Java 1.5, for instance:

http://www.antlr.org/grammar/1152141644268/Java.g

They use literals directly on the grammar avoiding to declare them
first on tokens {}. I think it's better to:

1) Let the tool (ANTLR) do the job for you (grammar.tokens).
2) Do not replicate code (tokens declarations vs simple literal
definition on the grammar itself).


In fact, I hit a problem when defining those tokens:

tokens {
(... other tokens defined...)
INT = 'INT';
}

If I just declare "INT" (only LHS), ANTLR complains:

warning(105): CL.g:120:14: no lexer rule corresponding to token: INT

I have to keep writing redundant statements like: INT = 'INT'; why is that ?


>
> program:        'PROGRAM'^ l_dec_vars l_dec_blocs l_instrs 'ENDPROGRAM'!;
>
> l_dec_vars:     ('VARS'! (dec_var)* 'ENDVARS'!|);
>
> dec_var:        IDENT^ constr_type;
>
> l_dec_blocs:    (dec_bloc)*;
>
> dec_bloc:       'PROCEDURE'^ header_proc l_dec_vars l_dec_blocs l_instrs
> 'ENDPROCEDURE'!
>
> This is where literals start to get confusing 'PROCEDURE'^ vs PROCEDURE and
> a lexer definition.


I don't see what's confusing here, there's no PROCEDURE, only
'PROCEDURE', no imaginary tokens, just the token itself in the right
place, isn't it ?


>
> constr_type:    ('INT' | 'Int' | 'int')  |  ('BOOL'|'bool')
>
> And this is probably where your bb goes wrong. I would bet that the lexer is
> starting to match 'bool' and gives up. Also, your case insensitivity is
> inconsistent with Int and Bool. These really need to be ( ('I'|'i')('N'|'n')
> and so on.


Ahhh, ok, now I see why it's that way on the "nocase" wiki page... thanks ! :)


> expsimple:      IDENT^ (func_call|) ('['^ expression ']'!)* ('.'^ IDENT
> ('['^ expression ']'!|))*       //XXX: match bb as identifier
>
> You probably are not getting here for input 'bb'.
>
>
> IDENT   :       'a'..'z' ('a'..'z'|'A'..'Z'|'0'..'9')* ;
> INTCONST:       ('0'..'9')+ ;
>
> WS      :       (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;};
> COMMA   :       ',';
> STRING  :       '~[]*'; // XXX
>
> Not sure what STRING is meant to match, but you know it matches the literal
> string ~[]* right?


Ok, that was just the simple definition on my C++ based, I'll try to
adjust to the new syntax...


> So:
>
> 1) Use the case insensitive match routine and specify our lexer tokens in
> UPPERCASE;


OK, done :)


> 2) Take the literals out and make real lexer tokens;


I wish to see a way cleaner (non-redundant) way to do that if it's
possible. Thanks again for your attention and patience ! :)


> Then you will at least be able to see what is going wrong (but my felling is
> that it won't go wrong when you do that unless your STRING rule is really
> ~('['| ']')* ? If you are just trying to pick up any characters that are not
> matched by anything else, just use:
>
> unknown : ANY+;
> ANY : . ;
>
> Jim
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>


More information about the antlr-interest mailing list