[antlr-interest] [newbie] Lexer Confusion

Fri Jul 4 15:30:09 PDT 2008

Pay no attention to the newbie (me) in the corner... :)
- Chris

On Fri, Jul 4, 2008 at 3:26 PM, UW Student <uw.anon at gmail.com> wrote:
> Chris Rebert wrote:
>>
>> On Fri, Jul 4, 2008 at 2:46 PM, UW Student <uw.anon at gmail.com> wrote:
>>>
>>> Johannes Luber wrote:
>>>>
>>>> UW Student schrieb:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm having some trouble understanding the behaviour of Antlr's lexer.
>>>>>  I
>>>>> am quite new to Antlr (having previously focussed on JFlex) so please
>>>>> excuse
>>>>> me if this is a naive question.
>>>>>
>>>>> My grammar is as follows
>>>>>
>>>>> grammar Test;
>>>>>
>>>>> nonTerm : TERM1 TERM2;
>>>>>
>>>>> TERM1 : '..'+;
>>>>> TERM2 : '.';
>>>>>
>>>>> However, when I try to recognize the string '...' (without the quotes),
>>>>> AntlrWorks indicates a MismatchedTokenException.  (Looking at the
>>>>> generated
>>>>> code, I believe this is because TERM1 is consuming the third DOT and
>>>>> then
>>>>> failing to find a fourth.)  I do not understand why this is happening.
>>>>>
>>>>> The above example is a toy language that I created to try to isolate
>>>>> the
>>>>> problem I was having.  My actual lexer looks more like this:
>>>>>
>>>>> TERM1 : (' ' | '...')+
>>>>> TERM2 : '.'
>>>>>
>>>>> And I would like ' .' to be lexed as [TERM1, TERM2].
>>>>>
>>>>> Any suggestions would be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>> ANTLR doesn't try TERM2 once it decides to try TERM1. This is a
>>>> limitation
>>>> of the analysis algorithm. To get your result, you have to try something
>>>> like:
>>>>
>>>> grammar Test2;
>>>>
>>>> tokens{
>>>> TERM2;
>>>> }
>>>>
>>>> nonTerm : TERM1 TERM2;
>>>>
>>>>
>>>> TERM1: '.' ( ('.')=> '.' {$type = TERM2;} ) ;
>>>>
>>>> Johannes
>>>>
>>> Hi Johannes,
>>>
>>> Thank you for your prompt response.
>>>
>>> I still have a couple of questions:
>>>
>>> 1) In my original grammar, how did the lexer decide which rule to attempt
>>> first?  Did it just pick the one that would result in the longer match?
>>
>> It chooses the one that comes first in the grammar file, IIRC.
>> - Chris
>>
>>> 2) Can you please confirm my understanding of your use of a syntactic
>>> predicate?  On a single DOT, the lexer will return a TERM1 token.  On a
>>> double DOT, the lexer will return a TERM2 token.  If this is the case,
>>> won't
>>> a triple DOT be lexed as TERM2 TERM1 (rather than the reverse)?
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>
> Hi Chris,
>
> As far as I know, unlike in many lexer tools, that is not the case in Antlr.
>  I recall reading that somewhere on antlr.org.  More to the point, reversing
> the order of TERM1 and TERM2 in my original grammar (and/or in the rule
> nonTerm) results in precisely the same error message.
>
> -Andrew
>