[antlr-interest] Lexer switching
Dean Tribble
tribble at e-dean.com
Sun Mar 27 09:38:54 PST 2005
Summary: I'm rebuilding the E grammar in antlr
(http://www.erights.org/e-impls/e-on-e/egrammar/). It contains a few
occurrences of *recursively* nesting grammars, for which the current
lexer switching is inadequate. I finally figured out a different and
simpler way to manage switching lexers that addresses this problem.
Context: E is an expression language, in which one of the expressions is
a "quasi-literal". A quasi-literal is similar to (but more general
than) a Perl string (or a Lisp quasi-list) in that it can contain
$-escaped expressions in E. Because these are arbitrary E expressions,
they can recursively contain further quasi-literals. A couple of examples:
print(`The value of X is $x`)
print(`Name: ${if (title.isEmpty()) { `${name}'s book` } else
{title}} date: $date`
Backquote introduces a quasi-literal (and causes a switch to the
quasi-lexer). Within a quasi-literal, '$' escapes a nested expression
(and switches to the E lexer). If it is just an identifier, then no
braces are required. If it is a more complex expression, then braces
are required. Note, however, that since E also uses braces, simply
encountering a brace is not the right reason to switch back to the quasi
lexer from the E lexer (and thus the lexer by itself cannot manage the
transitions back). BTW The more general aspects of quasi-literals are
outside the parsing realm, so I won't go into those, but they are really
cool :-)
I initially implemented all this with multiple parsers and corresponding
grammars, and blithely switched lexers in the parser productions. This
did not work at all because of lookahead and predicates: lexer actions
to switch the current lexer will happen during parser lookahead (whether
it was triggered by a parser predicate or not), whereas parser actions
by default do not. Even if you make them happen in predicates, they
still happen when the parser gets to the relevant token, which because
of lookahead, is already after the wrong (unswitched) lexer has already
consumed tokens past where you wanted to switch. Thus, all lexer
switching needs to happen purely within lexers. There is not sufficient
support to do that if the exit of the nested lexer must be driven by the
grammar (and thus the parser) rather than by simple lexing.
After several rounds, it seems that implementing lexer switching as a
TokenStream (via TokenStreamSelector) is not the right place. It is
less accessible to the lexer or parser, cannot participate in the
lookahead process, has less access to the line numbers, etc. I instead
merged that concept with the TokenBuffer, so a TokenMultiBuffer can
select from different streams of tokens. The class supports a stack of
nesting lexers. It also directly supports enter and exit (e.g., of
braces) so that a nested grammar can be exited when encountering a
close-"brace" that does not match an open brace in that grammar. I had
started support for participation in the parser predicates, but by then
had figured that brace management was enough to keep all lexer switching
under control of the lexers, so that semi-support is removed.
Even without the support for the parser changing the lexer, I found that
using a special TokenBuffer to handle lexer changing was simpler than
using a stream selector. This is mainly because it adds no layers in
the process of getting tokens, so debugging is easier. Because it does
not add per-token overhead, it could perfectly well be *always* wired
into the framework. Then, having multiple lexers would be special, but
would not be an awkward integration. In an informal sampling of
languages, I believe that multiple-lexers will be increasing, and
already would have been used more if it was more integral. Thus, I'd
like to see the token-buffer by default support it (fold the
TokenMultiBuffer support into TokenBuffer), and make it so that adding a
lexer will arrange for that lexer to have the shared input state
(instead of having to arrange that explicitly). This will also mean
that lexers will by default be able to switch lexers, and you won't need
to add language-specific code to do so (e.g., add a setSelector method,
etc.).
-------------- next part --------------
package antlr;
/**
* Combines and interleaves multiple streams of tokens together into a single stream.
* Actions during token recognition change the token stream to be provided. This
* simultaneously supports mark/rewind across changes in the token stream, recursive
* use of token streams (for languages that nest in each other), and tracking of
* such nesting (e.g., if a nested language is terminated by a bracket that is also
* used in an outer language).
*
* <p>
*
* @author Dean Tribble
*
* Software rights: http://www.antlr.org/license.html
*
* @see antlr.TokenBuffer
* @see antlr.Token
* @see antlr.TokenQueue
*/
public class TokenMultiBuffer extends TokenBuffer {
protected String[] myNames;
protected TokenStream[] myInputs;
protected MarkRecord myLayers = null;
protected int myEnterCount = 0;
/** Create a token buffer */
public TokenMultiBuffer(TokenStream input_) {
this(new String[]{"base"}, new TokenStream[]{input_});
}
/** Create a token buffer will multiple token streams. */
public TokenMultiBuffer(String[] names, TokenStream[] inputs) {
myNames = names;
myInputs = inputs;
input = myInputs[0];
}
/** Nest into a named lexer. */
public void push(String name) {
myLayers = new MarkRecord(myEnterCount, input, myLayers);
myEnterCount = 0;
select(findInput(name));
trace("push");
}
/** Exit the current lexer and continue lexing with the outer lexer. */
public void pop() {
trace("pop");
//TODO check for myNesting==0?
input = myLayers.stream;
myEnterCount = myLayers.enterCount;
myLayers = myLayers.next;
}
/** Record that we entered the kind of brace that will cause this lexer to exit.
* This could be called when parsing multiple brace-typed, or generalized to enforce
* specific matching (e.g., so that if both paren and bracket were supported, it
* would throw a parsing exception if the matching brace was not on the top of the
* brace stack). */
public void enterBrace() {
trace("enter");
myEnterCount++;
}
/** Exit from a nested brace. If the corresponding open brace was not lexed by this lexer
* (or more specifically, by the lexing of this layer if the grammars recursively nest), then
* this lexer is also finished lexing, so pop it and continue lexing in the outer lexer.
*/
public void exitBrace() {
trace("exit");
if (myEnterCount <= 0) {
pop();
} else {
myEnterCount--;
}
}
public void select(int streamNum) {
input = myInputs[streamNum];
}
public void select(String name) {
input = myInputs[findInput(name)];
trace("select");
}
private int findInput(String name) {
for (int i = 0, max = myInputs.length; i<max; i++) {
if (myNames[i].equals(name)) {
return i;
}
}
return -1; //not a great default, but ensures an error
}
private String findName(TokenStream ts) {
for (int i = 0, max = myInputs.length; i<max; i++) {
if (myInputs[i] == ts) {
return myNames[i];
}
}
return "unknown";
}
private class MarkRecord {
final int enterCount;
final TokenStream stream;
final MarkRecord next;
MarkRecord(int count, TokenStream s, MarkRecord n) {
enterCount = count;
stream = s;
next = n;
}
};
private void trace(String header) {
// System.err.print(header + " [" + findName(input));
// for (MarkRecord r = myLayers; r != null; r = r.next) {
// System.err.print(", " + findName(r.stream));
// }
// System.err.println("]");
// Thread.dumpStack();
}
}
More information about the antlr-interest
mailing list