[antlr-interest] Example code-generation target that outputs a state machine

Fri Oct 16 22:38:20 PDT 2009

Mike Samuel wrote:
>> Mike Samuel wrote:
>>> For background, my end goal is to bolt string interpolation onto
>>> javascript, but in a way that doesn't introduce XSS problems as
>>> described at
>>> http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html
> 
> David-Sarah Hopwood wrote:
>> Sounds like a fun project.
> 
>> ANTLR normally only generates explicit DFAs when there is a nontrivial
>> decision between alternatives of a rule. In the case of the ECMAScript 5
>> lexical grammar I'm using, it only generates two DFAs, one for the main
>> tokens rule, and one for DecimalLiteral.
> 
>> You can get ANTLRWorks to show a state machine diagram for any particular
>> decision, or for the tokens rule, but they're not particularly
>> enlightening. In principle, ANTLR computes all the information needed to
>> produce a pure FSM lexer, but I don't think it provides any way to actually
>> generate one, at least not without a lot of work. If that's definitely
>> what you need, ANTLR is probably not the right tool.
> 
> Thanks for the response, David-Sarah.  Fancy running into you here.  I
> guess you're using ANTLR for Jacaranda then?

Yes. Should be released fairly soon.

> How do you get a lexical grammar to distinguish division
> operators/regex literal using ES5 rules, or are you using the lexical
> grammar from Waldemar's JS 1.9 then?

I'm using feedback from the parser to the lexer. I looked carefully at
trying to use a finite state machine for this, but that approach can't
work if you want to parse full ES5 in all corner cases. The problematic
cases are things like "while (expr) /regexp/;", because it's impossible
to tell whether the '/' is part of a regexp literal without counting
whether the parentheses in 'expr' are balanced. Since expressions are
obviously not a regular language, that would require an infinite state
machine; parser -> lexer feedback turns out to be simpler.

Actually, if I had known how much complexity parsing regexp literals
correctly was going to add, I wouldn't have bothered; it's horrendous.
But that part is done now and seems to work.

Here's the comment I wrote about it:

====
ES5 says,

# There are two goal symbols for the lexical grammar. The InputElementDiv
# symbol is used in those syntactic grammar contexts where a leading
# division (/) or division-assignment (/=) operator is permitted. The
# InputElementRegExp symbol is used in other syntactic grammar contexts.

("Permitted" means syntactically valid in the *ES5* grammar.)

We don't quite follow this specification literally: we lex as a
RegularExpressionLiteral only in expression contexts, where a
regexp literal is permitted, rather than in all contexts where a
DivisionPunctuator is *not* permitted. This is equivalent for
syntactically valid input, since there are no contexts in which
both are permitted (the ES3 spec suggested otherwise, but see
<http://bugs.ecmascript.org/ticket/463>).

This has the advantage that for error recovery it is more sensible to
lex a misplaced '/' or '/=' as a misplaced DivPunctuator, rather than
as part of a RegularExpressionLiteral, since the latter are less
common (and would result in code up to the next '/' being ignored).

Implementation:

The lexer first lexes '/' as part of a DIV or DIVASSIGN token
(see the DIV rule in Punctuators above). If the parser finds such a
token in an expression context, then it resets the lexer and its
input stream so that the '/' will be the next char to be consumed.
A flag is also set in the lexer so that the next invocation of the
DIV rule will expect a regular expression literal. Then, the token
stream is reset to before the DIV or DIVASSIGN, discarding that
token and any following tokens read due to lookahead. This ensures
that when rewriting, the rewritten token stream does not contain
spurious tokens.

To support this functionality we must use a custom LazyTokenRewriteStream.
If the usual CommonTokenStream or TokenRewriteStream were used, the
lexer would lex all tokens on the first call to it (see the fillBuffer()
method of CommonTokenStream). Even if that were not the case, the lexer
necessarily runs ahead of the parser, due to token lookahead in the
parser. If the lexer may be reset, then lexing too far ahead is
inefficient, and more importantly causes spurious errors to be reported.

To solve this problem, LazyTokenRewriteStream only lexes tokens that
have been requested (typically by consume() or LA(k)). Each JacarandaToken
object records the errors associated with that token, so that these errors
can be reported only if the token remains part of the final token stream.
This also has the advantage of reporting all errors in the order of their
associated token position, which is a deterministic order independent of
lookahead depth.
====

> I poked about a bit more, and when I generate a lexer, I get code that
> looks like the below which sees to create instances of
> org.antlr.runtime.DFA as described at
> http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1_d_f_a.html
> .  I think I can probably poke at those to produce both (1) an
> alphabet reduction, and (2) a transition table.

Probably -- but as I said, only two (of about 42) of the decisions in an
ES5 lexer (written straightforwardly following the spec) generate DFAs;
the rest generate 'if' and 'switch' statements. Both of these DFAs are
just about exactly what you would write by hand, so I don't think you're
gaining much by using ANTLR to generate them.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com