[antlr-interest] V3 grammar writing tactics

Fri Dec 15 08:49:23 PST 2006

Hi all,

Can I pick the collective brains of the group's experts with a rather
general question about approaches to writing grammars?

Background
I am writing a grammar to produce ASTs for a Pascal variant of similar
complexity to Java, except that case insensitivity is required. I have
worked up from lexer tokens via expression rules to rules accepting
blocks of statements and am approaching completion with rules which will
enable programme source files to be analysed. The grammars have no
parser/lexer errors as the lower stages give no warnings about
ambiguity, recursion etc. However ANTLR V3.0b5 runs slower than 3.0b4 or
earlier, taking minutes rather than seconds for my grammars, and uses
large amounts of Heap memory (I have seen about 500MB in a profiler). I
have now reached an impasse because the more developed grammars cannot
be analysed because of runtime Heap overflow, even on command line runs.
I think that this is because the upper level grammars greatly increase
the complexity of the resultant AST by linking disparate branches
together. The running time is not very important to me, but Heap errors
that terminate analysis are show stoppers.

Tactics
My tactic for this task was to keep the individual rules as simple as
possible to aid understanding their operation. As a result there are a
large number of rules, I estimate that slightly more than 300 will be
needed in the full, tree constructing, grammar. A large number of these
rules, 148, are lexer tokens introduced to overcome V3 case sensitivity.
To stop ambiguity I allow a full k* analysis to occur.

Observations
The step that induced the final Heap overflow increased the rules from
272 to 274 to take the scope from "Block" to "Programme Block".
Terence analyses Java 1.5 files in 148 rules, virtually none of which
are tokens. (Note- the V3 J1.5 java.g only produces a flat linked AST.
>From this stage my grammars needed about 20 new rules to give the tree
topology which I require.) The java grammar is also analysed very
quickly on 3.0b5 and probably also uses minimal Heap.
Inspection shows that Ter's tactic was to write fewer, more complex
rules as distinct from my own more but simpler approach. He has also the
restriction, k=2, and accepts that two warnings are permissible.

Questions
1) Is it possible to generalise that one of these approaches to rule
writing is likely to produce grammars which are less memory heavy and
run faster? (That is - fewer, more complex rules vs. more, simpler ones.)
2) If so, is the second approach inevitably the "bad" one?
3) If not, should the aim be to keep k as low as possible whilst
eliminating all ambiguities?
4) If not, should the aim be to keep k as low as possible at the expense
 of accepting limited ambiguities?

Apologies for the length, but thanks anyway,
Andrew Smith