[antlr-interest] 4.0 daily builds

Sun Jan 1 13:53:57 PST 2012

On 01/01/2012 21:18, Terence Parr wrote:
>
> On Jan 1, 2012, at 1:05 PM, Eric wrote:
>
>> Ter,
>>
>>
>>> The most obvious differences with v4 are:
>>>
>>> * (directly) left recursive grammars that works great for expressions
>>> * modes in the lexer
>>>
>>
>> What are modes and how do they work. I noticed them looking at the antlr
>> parser grammar, but could not find any *.g4 file that gave me any help.
>
> Hi Eric, I'm working on an XML grammar that will demonstrate things nicely. They work well when you have multiple languages within the same file such as inside and outside XML tags. One can also consider the stuff inside of strings to be a different language than the outside. so here is a simple example that treats the 2 differently
>
> lexer grammar L;
>
> STRING_START : '"' {pushMode(STRING_MODE); more();} ;
> WS : ' '|'
> ' {skip();} ;
>
> mode STRING_MODE;
> STRING : '"' {popMode();} ;
> ANY : . {more();} ;
>
> We start out in default mode and when it sees a doublequote it switches to the string mode and asks the lexer to go get more input. Because we asked for more, lexer looks for more matches and matches a bunch of stuff to ANY and keeps looking.  It's only when we see the final double quote that we pop the modes and return an actual token.
>
> This mode stuff is  ancient as far as I can tell. For example, I see another tool doing it

Thinking back to when I was trying to implement ASN.1, such a nicety 
would have been useful in the parser as well. I guess the complication 
there is that looking ahead is handled in a more complicated way (it's 
not considering every possibility on a regular basis), but I'm not sure 
if that would be a deal breaker.

Then again, it might be possible to do something like that with lexer 
modes and multiple parsers.

What I immediately wonder is, in the special rules for inside the mode, 
can you generate tokens within it, or does a mode always have to return 
a single token? Can the special rules modify the text they're matching 
in terms of the text the eventual token gets? That would be particularly 
useful in the string example, to actually process escape sequences right 
there in the lexer.

Sam