[antlr-interest] IDENTifier rule not working for some tokens
brainstorm
braincode at gmail.com
Wed Oct 22 15:27:02 PDT 2008
Thanks a lot for your suggestions ! :) I'm starting a new svn branch
with your advices, but I have some questions... (replying/asking
between lines).
On Tue, Oct 21, 2008 at 6:04 PM, Jim Idle <jimi at temporal-wave.com> wrote:
> On Tue, 2008-10-21 at 11:28 +0200, brainstorm wrote:
>
> grammar CL;
>
> options
> {
> output = AST;
> // backtrack = true;
>
> Don't use this unless there is no other readable way.
What do you mean by that ? By the way, looks like it's the preferred
way for ANTLR if output is not defined:
warning(149): CL.g:0:0: rewrite syntax or operator with no output
option; setting output=AST
>
> If you are using ANTLRWorks, are you using the interpreter or the debugger -
> use the debugger pretty much always - the interpreter is handy for a quick
> verification of a sub rule and so on.
>
Yes, I do use ANTLRWorks with the jp{2,4} files I attached on my
previous email :)
>
> Especially when you are new to ANTLR 3 it is better not to use literals in
> the grammar but to define tokens in the lexer. The ambiguities between lexer
> tokens are obfuscated when using grammar level literals and typos change
> things that are difficult to discover.
>
I really see your point (and I'm applying it in my grammar), but have
a look at public grammars on:
http://www.antlr.org/grammar/list
Take the first, Java 1.5, for instance:
http://www.antlr.org/grammar/1152141644268/Java.g
They use literals directly on the grammar avoiding to declare them
first on tokens {}. I think it's better to:
1) Let the tool (ANTLR) do the job for you (grammar.tokens).
2) Do not replicate code (tokens declarations vs simple literal
definition on the grammar itself).
In fact, I hit a problem when defining those tokens:
tokens {
(... other tokens defined...)
INT = 'INT';
}
If I just declare "INT" (only LHS), ANTLR complains:
warning(105): CL.g:120:14: no lexer rule corresponding to token: INT
I have to keep writing redundant statements like: INT = 'INT'; why is that ?
>
> program: 'PROGRAM'^ l_dec_vars l_dec_blocs l_instrs 'ENDPROGRAM'!;
>
> l_dec_vars: ('VARS'! (dec_var)* 'ENDVARS'!|);
>
> dec_var: IDENT^ constr_type;
>
> l_dec_blocs: (dec_bloc)*;
>
> dec_bloc: 'PROCEDURE'^ header_proc l_dec_vars l_dec_blocs l_instrs
> 'ENDPROCEDURE'!
>
> This is where literals start to get confusing 'PROCEDURE'^ vs PROCEDURE and
> a lexer definition.
I don't see what's confusing here, there's no PROCEDURE, only
'PROCEDURE', no imaginary tokens, just the token itself in the right
place, isn't it ?
>
> constr_type: ('INT' | 'Int' | 'int') | ('BOOL'|'bool')
>
> And this is probably where your bb goes wrong. I would bet that the lexer is
> starting to match 'bool' and gives up. Also, your case insensitivity is
> inconsistent with Int and Bool. These really need to be ( ('I'|'i')('N'|'n')
> and so on.
Ahhh, ok, now I see why it's that way on the "nocase" wiki page... thanks ! :)
> expsimple: IDENT^ (func_call|) ('['^ expression ']'!)* ('.'^ IDENT
> ('['^ expression ']'!|))* //XXX: match bb as identifier
>
> You probably are not getting here for input 'bb'.
>
>
> IDENT : 'a'..'z' ('a'..'z'|'A'..'Z'|'0'..'9')* ;
> INTCONST: ('0'..'9')+ ;
>
> WS : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;};
> COMMA : ',';
> STRING : '~[]*'; // XXX
>
> Not sure what STRING is meant to match, but you know it matches the literal
> string ~[]* right?
Ok, that was just the simple definition on my C++ based, I'll try to
adjust to the new syntax...
> So:
>
> 1) Use the case insensitive match routine and specify our lexer tokens in
> UPPERCASE;
OK, done :)
> 2) Take the literals out and make real lexer tokens;
I wish to see a way cleaner (non-redundant) way to do that if it's
possible. Thanks again for your attention and patience ! :)
> Then you will at least be able to see what is going wrong (but my felling is
> that it won't go wrong when you do that unless your STRING rule is really
> ~('['| ']')* ? If you are just trying to pick up any characters that are not
> matched by anything else, just use:
>
> unknown : ANY+;
> ANY : . ;
>
> Jim
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>
More information about the antlr-interest
mailing list