[antlr-interest] Lexer switching

Sat Apr 9 18:31:22 PDT 2005

Hi Dean, sorry for the delay.  Comments below.

On Mar 27, 2005, at 9:38 AM, Dean Tribble wrote:

> Summary: I'm rebuilding the E grammar in antlr 
> (http://www.erights.org/e-impls/e-on-e/egrammar/).  It contains a few 
> occurrences of *recursively* nesting grammars, for which the current 
> lexer switching is inadequate.  I finally figured out a different and 
> simpler way to manage switching lexers that addresses this problem.

So, I've been playing with shared input streams etc... with ANTLR v3.  
I looked at the javadoc problem again which is a non nested thing so 
simpler than your problem.  But my new idea is probably very similar to 
what you are doing.  What I do is simply consider every "island grammar 
input chunk" such as javadoc or quasi-literal as a separate "file".  
So, when I hit the last token that says to bail out, I just return EOF. 
:)  Then there is no explicit stack of input streams etc...  For 
example,

In my Java lexer I do this:

JAVADOC : "/**"
           {
             JavadocLexer j = new JavadocLexer(input);
             CommonTokenStream tokens = new CommonTokenStream(j);
             tokens.discardTokenType(JavadocLexer.WS);
             Javadoc p = new Javadoc(tokens);
             p.comment();
             channel = Simple.JAVADOC_CHANNEL;
           }
         ;

which creates a new javadoc lexer/parser duo that feeds off the same 
input stream.

Here is the complete combined javadoc spec (well sufficient for my 
braindead example app):

grammar Javadoc;

comment : ( author )* ;

author  : "@author" ID {System.out.println("author");} ;

ID      : ('a'..'z'|'A'..'Z')+ {System.out.println("id");};

END     : "*/" {token = Token.EOFToken;} ;

WS      : (' '|'\t'|'\n')+
         ;

Notice how when it sees "*/" it just says "I'm done", which consumes 
the tokens and forces the javadoc parser (feeding off the input stream 
currently) to exit.  It returns from method comment(), called from 
JAVADOC action and then finishes up with that

channel = Simple.JAVADOC_CHANNEL;

that returns a JAVADOC token to the java parser but on a different 
channel than the normal token stream so it doesn't get in the way.

All the lookahead in the Java token stream is cool as I create a *new* 
token stream to race off and do the javadoc island grammar.

Notes on my current runtime:

A Parser object is 9 bytes (well, at least my part): input stream ptr, 
following token stack, and error mode boolean.
A Lexer object is 8 bytes: input stream ptr and current token ptr

Any number of recognizers can point at the same input stream and, for 
stuff like includes, you can create new input streams on the fly and 
start reading from them.

The lexers and parsers have NO STATE other than where to get input.  
Pretty sweet.

> Context: E is an expression language, in which one of the expressions 
> is a "quasi-literal".  A quasi-literal is similar to (but more general 
> than) a Perl string (or a Lisp quasi-list) in that it can contain 
> $-escaped expressions in E.  Because these are arbitrary E 
> expressions, they can recursively contain further quasi-literals.  A 
> couple of examples:
>
>    print(`The value of X is $x`)
>
>    print(`Name: ${if (title.isEmpty()) { `${name}'s book` } else 
> {title}} date: $date`
>
> Backquote introduces a quasi-literal (and causes a switch to the 
> quasi-lexer).  Within a quasi-literal, '$' escapes a nested expression 
> (and switches to the E lexer).  If it is just an identifier, then no 
> braces are required.  If it is a more complex expression, then braces 
> are required.  Note, however, that since E also uses braces, simply 
> encountering a brace is not the right reason to switch back to the 
> quasi lexer from the E lexer (and thus the lexer by itself cannot 
> manage the transitions back).

Ack, this part is really nasty.  If we can't use the '}' as a signal to 
switch back (like the "*/" in javadoc), the main E lexer needs to know 
something like the nesting level, doesn't it?  If it sees a dangling 
'}', it needs to return EOF signaling the end of E parsing and a return 
to the quasi-literal.  Ok, I just made my simple Java-like grammar 
allow java inside javadoc comments like this:

/** @author foo {z=3;} {yy=33;}*/

Note that at the }*/ it must exit the Java component and reuse that 
char to terminate the javadoc comment.  I just needed to add this in my 
normal java-mode lexing:

LCURLY  : '{' {Simple.nesting++;}
         ;
RCURLY  : '}' {if ( Simple.nesting\<=0 ) token=Token.EOFToken;}
         ;

In my javadoc lexer I needed:

JAVA    :   '{'
             {
             System.out.println("enter java escape");
             SimpleLexer lex = new SimpleLexer(input);
             CommonTokenStream tokens = new CommonTokenStream(lex);
             //System.out.println("tokens="+tokens);
             Simple parser = new Simple(tokens);
             parser.statement();
             channel=EMBEDDED_JAVA_CHANNEL;
             }
         ;

Pretty slick, eh?  So the java lexer invokes the javadoc lexer which 
invokes the java lexer again.  The key seems to be returning an EOF 
token when you see the "final" token.  Then again, if you have an 
action that asks for LT(100) you might be in trouble.  In 3.0, the 
token stream sucks up all tokens before parsing so the EOF from the '}' 
will make it stop "sucking" tokens from the stream.  Any LT(100) action 
in the parser will simply return EOF.  The only weird thing is that the 
embedded java after the @author will be processed *before* the author 
is processed because the author is processed by the javadoc parser--the 
embedded java is handled during javadoc *lexing*.

Whew!  I think I've convinced myself that, at least v3, will handle 
this nicely (since I have an existence proof).

Does this sound like it will solve your problem?

Ter
PS	 It may be time to get you using the 3.0 version ;)  It's *almost* 
useful ;)
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com