[antlr-interest] Example code-generation target that outputs a state machine

Fri Oct 16 20:01:13 PDT 2009

> Mike Samuel wrote:
> > I'd like to derive a state machine that recognizes a combined lexical
> > grammar of JS/HTML/CSS (hacked so that JS has a regular lexical
> > grammar) and a mapping from states to the production they're part of.
> > I need to keep the number of states small.
> >
> > I saw something tantalizing about "dfaState() String Template" at
> > http://www.antlr.org/wiki/display/ANTLR3/How+to+build+an+ANTLR+code+generation+target
> > but am still unsure how to proceed.
> >
> > Is this possible with ANTLR, and if so, does anyone know of existing
> > code I could adapt?
> >
> > For background, my end goal is to bolt string interpolation onto
> > javascript, but in a way that doesn't introduce XSS problems as
> > described at
> > http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html

David-Sarah Hopwood wrote:
> Sounds like a fun project.

> ANTLR normally only generates explicit DFAs when there is a nontrivial
> decision between alternatives of a rule. In the case of the ECMAScript 5
> lexical grammar I'm using, it only generates two DFAs, one for the main
> tokens rule, and one for DecimalLiteral.

> You can get ANTLRWorks to show a state machine diagram for any particular
> decision, or for the tokens rule, but they're not particularly
> enlightening. In principle, ANTLR computes all the information needed to
> produce a pure FSM lexer, but I don't think it provides any way to actually
> generate one, at least not without a lot of work. If that's definitely
> what you need, ANTLR is probably not the right tool.

Thanks for the response, David-Sarah.  Fancy running into you here.  I
guess you're using ANTLR for Jacaranda then?
How do you get a lexical grammar to distinguish division
operators/regex literal using ES5 rules, or are you using the lexical
grammar from Waldemar's JS 1.9 then?

I poked about a bit more, and when I generate a lexer, I get code that
looks like the below which sees to create instances of
org.antlr.runtime.DFA as described at
http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1_d_f_a.html
.  I think I can probably poke at those to produce both (1) an
alphabet reduction, and (2) a transition table.
It won't integrate nicely into a build from grammar, and it's hacky,
but if the DFA just isn't exposed to code-generation targets, then
it'll be easier than trying to engineer a way through the front door.

======
    static final short[] DFA19_eot = DFA.unpackEncodedString(DFA19_eotS);
    // more unpacking elided

    static {
        int numStates = DFA19_transitionS.length;
        DFA19_transition = new short[numStates][];
        for (int i=0; i<numStates; i++) {
            DFA19_transition[i] = DFA.unpackEncodedString(DFA19_transitionS[i]);
        }
    }

    class DFA19 extends DFA {

        public DFA19(BaseRecognizer recognizer) {
            this.recognizer = recognizer;
            this.decisionNumber = 19;
            this.eot = DFA19_eot;
            // more grabbing of statics elided
        }
        public String getDescription() {
            return /* large chunks of my grammar elided */;
        }
    }