[antlr-interest] V3 grammar writing tactics

Jim Idle jimi at intersystems.com
Fri Dec 15 12:30:04 PST 2006


Andrew,

A couple of points for you.

Rather than try to use lexer tokens for case insensitivity, try the code
below this text. This will use upper case only to recognize tokens, but
will preserve case in the input stream and is probably what most people
want to do.

Perhaps recent grammar analysis changes have bitten you on this one, I
know some bugs have been fixed and some others tweaked since the release
you are using, but not exactly sure what.

Assuming that the analysis will eventually finish and not just loop
forever (ANTLRWorks does this on certain strange inputs right now), then
use the command line option: -Xmx1000M  on your java invocation
(assuming Windows, you may need to consult the java command line for you
actual system etc).

Jim


Use the following where you would normally use an ANTLRFileStream, to
create a case insensitive lexer which preserves case in the input
stream:

import org.antlr.runtime.*;
import java.io.*;

/**
 *
 * @author jimi
 */
public class ANTLRNoCaseFileStream  extends ANTLRFileStream
{
	public ANTLRNoCaseFileStream(String fileName) throws IOException
{
		super(fileName, null);
	}

	public ANTLRNoCaseFileStream(String fileName, String encoding)
throws IOException {
		super(fileName, encoding);
	}
	    
	public int LA(int i) {
		if ( i==0 ) {
			return 0; // undefined
		}
		if ( i<0 ) {
			i++; // e.g., translate LA(-1) to use offset 0 
		}

		if ( (p+i-1) >= n ) {
            //System.out.println("char LA("+i+")=EOF; p="+p);
            return CharStream.EOF;
        }
        //System.out.println("char LA("+i+")="+data.charAt(p+i-1)+";
p="+p);
        return Character.toUpperCase(data[p+i-1]);
    }

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Andrew Smith
Sent: Friday, December 15, 2006 8:49 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] V3 grammar writing tactics

Hi all,

Can I pick the collective brains of the group's experts with a rather
general question about approaches to writing grammars?

Background
I am writing a grammar to produce ASTs for a Pascal variant of similar
complexity to Java, except that case insensitivity is required. I have
worked up from lexer tokens via expression rules to rules accepting
blocks of statements and am approaching completion with rules which will
enable programme source files to be analysed. The grammars have no
parser/lexer errors as the lower stages give no warnings about
ambiguity, recursion etc. However ANTLR V3.0b5 runs slower than 3.0b4 or
earlier, taking minutes rather than seconds for my grammars, and uses
large amounts of Heap memory (I have seen about 500MB in a profiler). I
have now reached an impasse because the more developed grammars cannot
be analysed because of runtime Heap overflow, even on command line runs.
I think that this is because the upper level grammars greatly increase
the complexity of the resultant AST by linking disparate branches
together. The running time is not very important to me, but Heap errors
that terminate analysis are show stoppers.

Tactics
My tactic for this task was to keep the individual rules as simple as
possible to aid understanding their operation. As a result there are a
large number of rules, I estimate that slightly more than 300 will be
needed in the full, tree constructing, grammar. A large number of these
rules, 148, are lexer tokens introduced to overcome V3 case sensitivity.
To stop ambiguity I allow a full k* analysis to occur.

Observations
The step that induced the final Heap overflow increased the rules from
272 to 274 to take the scope from "Block" to "Programme Block".
Terence analyses Java 1.5 files in 148 rules, virtually none of which
are tokens. (Note- the V3 J1.5 java.g only produces a flat linked AST.
>From this stage my grammars needed about 20 new rules to give the tree
topology which I require.) The java grammar is also analysed very
quickly on 3.0b5 and probably also uses minimal Heap.
Inspection shows that Ter's tactic was to write fewer, more complex
rules as distinct from my own more but simpler approach. He has also the
restriction, k=2, and accepts that two warnings are permissible.

Questions
1) Is it possible to generalise that one of these approaches to rule
writing is likely to produce grammars which are less memory heavy and
run faster? (That is - fewer, more complex rules vs. more, simpler
ones.)
2) If so, is the second approach inevitably the "bad" one?
3) If not, should the aim be to keep k as low as possible whilst
eliminating all ambiguities?
4) If not, should the aim be to keep k as low as possible at the expense
 of accepting limited ambiguities?

Apologies for the length, but thanks anyway,
Andrew Smith


More information about the antlr-interest mailing list