[antlr-interest] [newbie] Lexer Confusion

UW Student uw.anon at gmail.com
Fri Jul 4 15:26:25 PDT 2008


Chris Rebert wrote:
> On Fri, Jul 4, 2008 at 2:46 PM, UW Student <uw.anon at gmail.com> wrote:
>> Johannes Luber wrote:
>>> UW Student schrieb:
>>>> Hello,
>>>>
>>>> I'm having some trouble understanding the behaviour of Antlr's lexer.  I
>>>> am quite new to Antlr (having previously focussed on JFlex) so please excuse
>>>> me if this is a naive question.
>>>>
>>>> My grammar is as follows
>>>>
>>>> grammar Test;
>>>>
>>>> nonTerm : TERM1 TERM2;
>>>>
>>>> TERM1 : '..'+;
>>>> TERM2 : '.';
>>>>
>>>> However, when I try to recognize the string '...' (without the quotes),
>>>> AntlrWorks indicates a MismatchedTokenException.  (Looking at the generated
>>>> code, I believe this is because TERM1 is consuming the third DOT and then
>>>> failing to find a fourth.)  I do not understand why this is happening.
>>>>
>>>> The above example is a toy language that I created to try to isolate the
>>>> problem I was having.  My actual lexer looks more like this:
>>>>
>>>> TERM1 : (' ' | '...')+
>>>> TERM2 : '.'
>>>>
>>>> And I would like ' .' to be lexed as [TERM1, TERM2].
>>>>
>>>> Any suggestions would be greatly appreciated.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>> ANTLR doesn't try TERM2 once it decides to try TERM1. This is a limitation
>>> of the analysis algorithm. To get your result, you have to try something
>>> like:
>>>
>>> grammar Test2;
>>>
>>> tokens{
>>> TERM2;
>>> }
>>>
>>> nonTerm : TERM1 TERM2;
>>>
>>>
>>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} ) ;
>>>
>>> Johannes
>>>
>> Hi Johannes,
>>
>> Thank you for your prompt response.
>>
>> I still have a couple of questions:
>>
>> 1) In my original grammar, how did the lexer decide which rule to attempt
>> first?  Did it just pick the one that would result in the longer match?
> 
> It chooses the one that comes first in the grammar file, IIRC.
> - Chris
> 
>> 2) Can you please confirm my understanding of your use of a syntactic
>> predicate?  On a single DOT, the lexer will return a TERM1 token.  On a
>> double DOT, the lexer will return a TERM2 token.  If this is the case, won't
>> a triple DOT be lexed as TERM2 TERM1 (rather than the reverse)?
>>
>> Thanks,
>> Andrew
>>
> 

Hi Chris,

As far as I know, unlike in many lexer tools, that is not the case in 
Antlr.  I recall reading that somewhere on antlr.org.  More to the 
point, reversing the order of TERM1 and TERM2 in my original grammar 
(and/or in the rule nonTerm) results in precisely the same error message.

-Andrew


More information about the antlr-interest mailing list