[antlr-interest] Understanding priorities in lexing (newbie)
Gavin Lambert
antlr at mirality.co.nz
Fri Jul 13 01:16:43 PDT 2007
At 11:18 13/07/2007, Terence Parr wrote:
>> Case 1:
>> grammar lex;
>> KEYWORD : 'a' 'b' 'c';
>> OTHER : 'a' | 'b' | 'c';
>> program : ( KEYWORD | OTHER )*
>>
>> Input: "aba" chokes on second a
>>
>> Case 2:
>> grammar lex;
>> kEYWORD : 'a' 'b' 'c';
>> oTHER : 'a' | 'b' | 'c';
>> program : ( kEYWORD | oTHER )*
>>
>> Input: "aba" outputs oTHER oTHER oTHER
>>
>> Same grammar, two different state machines.
>
>Well, more that the first grammar has 2 lexer rules and the
>second has three: one for each char.
Yes, which simply means that the lexer is different. I don't
think that was the point :) For the first example, you should
consider the lexer *only*, and in the second example, the parser
*only*. The other halves are irrelevant.
In the first case, the input to the lexer is the character
sequence 'a','b','a'. In the second case, the input to the
*parser* is the token sequence A,B,A. There is no fundamental
difference here -- the rules for dealing with that input appear
the same and therefore should result in the same output. But they
don't.
As you said, the fundamental DFA engine behind the lexer and
parser is the same, so the difference must lie elsewhere. And
there is one difference: the lexer has one additional implicit
rule, the Tokens rule. The parser has no equivalent; or rather
the closest equivalent it does have (the 'program' rule) contains
a cycle (*), so generates a very different DFA.
Restating the examples a bit (and including the implicit rule)
makes this more obvious:
lexer grammar Lex;
KEYWORD : 'a' 'b' 'c';
OTHER : 'a' | 'b' | 'c';
Tokens : ( KEYWORD | OTHER );
parser grammar Parse;
kEYWORD : A B C;
oTHER : A | B | C;
program : ( kEYWORD | oTHER )*;
The parser "works"; the lexer doesn't. It has to be the Tokens
rule.
(Since the lexer example is explicitly showing an implicit rule,
it's not valid ANTLR -- if you actually want to run it in ANTLR,
you'd need to do one of the following:
1. remove the Tokens rule
2. make the KEYWORD and OTHER rules fragments, and possibly rename
the Tokens rule
Similarly the parser would need to import its token vocab from
somewhere.)
More information about the antlr-interest
mailing list