[antlr-interest] idea: lexer "sync points"
J Chapman Flack
jflack at math.purdue.edu
Thu Feb 7 10:08:25 PST 2008
In some previous thread I glimpsed a statement passing by suggesting
that ANTLR 3 currently generates a lexer that tokenizes the entire
input before parsing begins.
1. Did I hear that right? (If not, can someone give a quick summary of
the right way to understand how an ANTLR 3 lexer really does manage
its input character stream and output token stream buffers? In that
case, some of the rest of this message may become moot.)
2. If I did, there are a couple of concerns that raises for me. The
first is already often mentioned, that it is hard to build a
stateful lexer taking cues from the parser, if the lexer runs
arbitrarily far ahead.
The second concern is one I haven't seen mentioned here lately, so
here it is: what if I want my program, with an ANTLR-parsed input
language, to work in a coprocess setting?
Here is a simple example, using the Korn shell on Solaris 9, doing a
calculation too big for ksh itself:
$ bc |&
$ print -p "sqrt(12345678987654321)"
$ read -p result
$ print $result
111111111
$ print -p "sqrt(12345679012345678987654320987654321)"
$ read -p result
$ print $result
111111111111111111
With this technique I can write a shell script that does any amount
of arbitrary-precision arithmetic but only spends the cost of
starting a bc subprocess once (in a Unix-like environnment that's a
significant cost, so reusing one process is a win). The
technique generalizes; any program can use it, but not every program
can be used as a subprocess--sometimes for good reasons, sometimes
not. Here's one that won't work, for no good reason:
$ tr '[a-z]' '[A-Z]' |&
$ print -p "i'm not shouting"
$ read -p result
[deadlock]
The version of tr on this system uses a braindamaged input buffering
strategy. Its read() call has completed and given it a perfectly
good short result to operate on--pipes and the I/O calls are
specifically designed to work that way--but the programmer has
insisted on blindly filling an arbitrarily-sized buffer before doing
anything useful. The result is a program that is unusable as a
coprocess.
When developing a program with a generated lexer/parser, two things are
essential to ensuring that the program won't deadlock that way:
1. The language facilities (InputStream, Reader, etc.) used by the
program to supply input to the generated lexer have to be chosen and
used carefully to avoid blindly blocking for more data when a
partial buffer is available. Generally the language features are
designed so that this requirement can be met if the programmer pays
attention.
2. The lexer generator has to generate code that avoids unnecessary
blocking calls on its input source. (I once had to give up on an
ANTLR competitor for a given project, because there was NO WAY to
make it generate a lexer with this property.)
I had a thought on a lexer-grammar feature that could be useful for
both of these issues (stateful lexing, and coprocess usability).
Suppose you could mark, at chosen points in the lexer grammar, a SYNC
POINT. This is a point beyond which the generated lexer will buffer no
more tokens, and make no more blocking calls on input, until the parser
has caught up the token buffer and requested another token. For
stateful lexing you would plant a sync point at the spots where the
parser might want to switch the lexing behavior. For deadlock avoidance
your sync points would be at newlines or semicolons or whatever could
terminate an independently executable statement in your language.
The generator should be able to analyze the parser and lexer to warn of
any decisions where lookahead would have to cross a sync point.
I think this could be a quite useful feature. Comments?
Chapman Flack
More information about the antlr-interest
mailing list