[antlr-interest] Parsing CSS accurately and fast

Sun Sep 18 19:04:11 PDT 2011

We've been trying to build a high-performance yet accurate CSS parser using
Antlr for the last few months.

To date, our efforts have yielded accuracy, but not performance.

The main problem with CSS is what's called the CSS parsing conventions
<http://www.w3.org/TR/CSS21/syndata.html#parsing-errors> , or how to
correctly handle parse errors.

There is a core syntax
<http://www.w3.org/TR/CSS21/syndata.html#tokenization>  that all versions of
CSS use. Conceptually, to parse say CSS2.1, we first parse the file
according to the core syntax, and then flesh out the parse tree with the
CSS2.1 grammar. The core syntax causes the right things to happen when
invalid tokens are seen.

We implemented it this way - see this stackoverflow question:
http://stackoverflow.com/questions/5437835/parsing-css-2-1-with-the-correct-
css-parsing-conventions-in-antlr.

However, this double parsing creates a new instance of the CSS2.1 parser for
each successfully parsed piece of the core grammar. This results in
extremely slow parse times.

We also tried rewriting the input stream and adding custom terminators
around each piece parsed by the CSS core grammar, and feeding the result in
its entirety to the CSS2.1 parser (augmented with rules for the custom
terminators), but this turned out to be even slower.

Is there a way to do better than this in Antlr? (

At this point, we're considering writing a hand-coded recursive descent
parser, hopefully there is a better way  with Antlr J

Regards,

Vivek