[antlr-interest] Lexer switching
Terence Parr
parrt at cs.usfca.edu
Sat Apr 9 18:31:22 PDT 2005
Hi Dean, sorry for the delay. Comments below.
On Mar 27, 2005, at 9:38 AM, Dean Tribble wrote:
> Summary: I'm rebuilding the E grammar in antlr
> (http://www.erights.org/e-impls/e-on-e/egrammar/). It contains a few
> occurrences of *recursively* nesting grammars, for which the current
> lexer switching is inadequate. I finally figured out a different and
> simpler way to manage switching lexers that addresses this problem.
So, I've been playing with shared input streams etc... with ANTLR v3.
I looked at the javadoc problem again which is a non nested thing so
simpler than your problem. But my new idea is probably very similar to
what you are doing. What I do is simply consider every "island grammar
input chunk" such as javadoc or quasi-literal as a separate "file".
So, when I hit the last token that says to bail out, I just return EOF.
:) Then there is no explicit stack of input streams etc... For
example,
In my Java lexer I do this:
JAVADOC : "/**"
{
JavadocLexer j = new JavadocLexer(input);
CommonTokenStream tokens = new CommonTokenStream(j);
tokens.discardTokenType(JavadocLexer.WS);
Javadoc p = new Javadoc(tokens);
p.comment();
channel = Simple.JAVADOC_CHANNEL;
}
;
which creates a new javadoc lexer/parser duo that feeds off the same
input stream.
Here is the complete combined javadoc spec (well sufficient for my
braindead example app):
grammar Javadoc;
comment : ( author )* ;
author : "@author" ID {System.out.println("author");} ;
ID : ('a'..'z'|'A'..'Z')+ {System.out.println("id");};
END : "*/" {token = Token.EOFToken;} ;
WS : (' '|'\t'|'\n')+
;
Notice how when it sees "*/" it just says "I'm done", which consumes
the tokens and forces the javadoc parser (feeding off the input stream
currently) to exit. It returns from method comment(), called from
JAVADOC action and then finishes up with that
channel = Simple.JAVADOC_CHANNEL;
that returns a JAVADOC token to the java parser but on a different
channel than the normal token stream so it doesn't get in the way.
All the lookahead in the Java token stream is cool as I create a *new*
token stream to race off and do the javadoc island grammar.
Notes on my current runtime:
A Parser object is 9 bytes (well, at least my part): input stream ptr,
following token stack, and error mode boolean.
A Lexer object is 8 bytes: input stream ptr and current token ptr
Any number of recognizers can point at the same input stream and, for
stuff like includes, you can create new input streams on the fly and
start reading from them.
The lexers and parsers have NO STATE other than where to get input.
Pretty sweet.
> Context: E is an expression language, in which one of the expressions
> is a "quasi-literal". A quasi-literal is similar to (but more general
> than) a Perl string (or a Lisp quasi-list) in that it can contain
> $-escaped expressions in E. Because these are arbitrary E
> expressions, they can recursively contain further quasi-literals. A
> couple of examples:
>
> print(`The value of X is $x`)
>
> print(`Name: ${if (title.isEmpty()) { `${name}'s book` } else
> {title}} date: $date`
>
> Backquote introduces a quasi-literal (and causes a switch to the
> quasi-lexer). Within a quasi-literal, '$' escapes a nested expression
> (and switches to the E lexer). If it is just an identifier, then no
> braces are required. If it is a more complex expression, then braces
> are required. Note, however, that since E also uses braces, simply
> encountering a brace is not the right reason to switch back to the
> quasi lexer from the E lexer (and thus the lexer by itself cannot
> manage the transitions back).
Ack, this part is really nasty. If we can't use the '}' as a signal to
switch back (like the "*/" in javadoc), the main E lexer needs to know
something like the nesting level, doesn't it? If it sees a dangling
'}', it needs to return EOF signaling the end of E parsing and a return
to the quasi-literal. Ok, I just made my simple Java-like grammar
allow java inside javadoc comments like this:
/** @author foo {z=3;} {yy=33;}*/
Note that at the }*/ it must exit the Java component and reuse that
char to terminate the javadoc comment. I just needed to add this in my
normal java-mode lexing:
LCURLY : '{' {Simple.nesting++;}
;
RCURLY : '}' {if ( Simple.nesting\<=0 ) token=Token.EOFToken;}
;
In my javadoc lexer I needed:
JAVA : '{'
{
System.out.println("enter java escape");
SimpleLexer lex = new SimpleLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lex);
//System.out.println("tokens="+tokens);
Simple parser = new Simple(tokens);
parser.statement();
channel=EMBEDDED_JAVA_CHANNEL;
}
;
Pretty slick, eh? So the java lexer invokes the javadoc lexer which
invokes the java lexer again. The key seems to be returning an EOF
token when you see the "final" token. Then again, if you have an
action that asks for LT(100) you might be in trouble. In 3.0, the
token stream sucks up all tokens before parsing so the EOF from the '}'
will make it stop "sucking" tokens from the stream. Any LT(100) action
in the parser will simply return EOF. The only weird thing is that the
embedded java after the @author will be processed *before* the author
is processed because the author is processed by the javadoc parser--the
embedded java is handled during javadoc *lexing*.
Whew! I think I've convinced myself that, at least v3, will handle
this nicely (since I have an existence proof).
Does this sound like it will solve your problem?
Ter
PS It may be time to get you using the 3.0 version ;) It's *almost*
useful ;)
--
CS Professor & Grad Director, University of San Francisco
Creator, ANTLR Parser Generator, http://www.antlr.org
Cofounder, http://www.jguru.com
More information about the antlr-interest
mailing list