[antlr-interest] Example code-generation target that outputs a state machine

Fri Oct 16 00:50:37 PDT 2009

Mike Samuel wrote:
> I'd like to derive a state machine that recognizes a combined lexical
> grammar of JS/HTML/CSS (hacked so that JS has a regular lexical
> grammar) and a mapping from states to the production they're part of.
> I need to keep the number of states small.
> 
> I saw something tantalizing about "dfaState() String Template" at
> http://www.antlr.org/wiki/display/ANTLR3/How+to+build+an+ANTLR+code+generation+target
> but am still unsure how to proceed.
> 
> Is this possible with ANTLR, and if so, does anyone know of existing
> code I could adapt?
> 
> For background, my end goal is to bolt string interpolation onto
> javascript, but in a way that doesn't introduce XSS problems as
> described at
> http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html

Sounds like a fun project.

ANTLR normally only generates explicit DFAs when there is a nontrivial
decision between alternatives of a rule. In the case of the ECMAScript 5
lexical grammar I'm using, it only generates two DFAs, one for the main
tokens rule, and one for DecimalLiteral.

You can get ANTLRWorks to show a state machine diagram for any particular
decision, or for the tokens rule, but they're not particularly
enlightening. In principle, ANTLR computes all the information needed to
produce a pure FSM lexer, but I don't think it provides any way to actually
generate one, at least not without a lot of work. If that's definitely
what you need, ANTLR is probably not the right tool.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com