[antlr-interest] [newbie] Lexer Confusion

Chris Rebert cvrebert at gmail.com
Fri Jul 4 15:04:02 PDT 2008


On Fri, Jul 4, 2008 at 2:46 PM, UW Student <uw.anon at gmail.com> wrote:
> Johannes Luber wrote:
>>
>> UW Student schrieb:
>>>
>>> Hello,
>>>
>>> I'm having some trouble understanding the behaviour of Antlr's lexer.  I
>>> am quite new to Antlr (having previously focussed on JFlex) so please excuse
>>> me if this is a naive question.
>>>
>>> My grammar is as follows
>>>
>>> grammar Test;
>>>
>>> nonTerm : TERM1 TERM2;
>>>
>>> TERM1 : '..'+;
>>> TERM2 : '.';
>>>
>>> However, when I try to recognize the string '...' (without the quotes),
>>> AntlrWorks indicates a MismatchedTokenException.  (Looking at the generated
>>> code, I believe this is because TERM1 is consuming the third DOT and then
>>> failing to find a fourth.)  I do not understand why this is happening.
>>>
>>> The above example is a toy language that I created to try to isolate the
>>> problem I was having.  My actual lexer looks more like this:
>>>
>>> TERM1 : (' ' | '...')+
>>> TERM2 : '.'
>>>
>>> And I would like ' .' to be lexed as [TERM1, TERM2].
>>>
>>> Any suggestions would be greatly appreciated.
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>> ANTLR doesn't try TERM2 once it decides to try TERM1. This is a limitation
>> of the analysis algorithm. To get your result, you have to try something
>> like:
>>
>> grammar Test2;
>>
>> tokens{
>> TERM2;
>> }
>>
>> nonTerm : TERM1 TERM2;
>>
>>
>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} ) ;
>>
>> Johannes
>>
>
> Hi Johannes,
>
> Thank you for your prompt response.
>
> I still have a couple of questions:
>
> 1) In my original grammar, how did the lexer decide which rule to attempt
> first?  Did it just pick the one that would result in the longer match?

It chooses the one that comes first in the grammar file, IIRC.
- Chris

>
> 2) Can you please confirm my understanding of your use of a syntactic
> predicate?  On a single DOT, the lexer will return a TERM1 token.  On a
> double DOT, the lexer will return a TERM2 token.  If this is the case, won't
> a triple DOT be lexed as TERM2 TERM1 (rather than the reverse)?
>
> Thanks,
> Andrew
>


More information about the antlr-interest mailing list