[antlr-interest] Greedy matching to end of line

Robert J. Hansen rjh at sixdemonbag.org
Thu Jan 27 23:51:58 PST 2011


I haven't done any work with lexers and parsers in many years, and
figured a good way to go about getting re-acquainted would be to find a
big corpus of text and put together a translator.  The corpus I had
around was the ARIN WHOIS information, which is basically key-value
coding in a record-based format.  Newlines are significant, but other
whitespace generally isn't.

I'm now running into a brick wall, though, with trying to enable greedy
matching -- scarfing up everything to end-of-line and returning that
back as a string.  I can *almost* do it, but I'm getting killed on some
corner cases.

The following is an abbreviated version of the grammar.  The bug is
present in this, but all actions, etc., have been omitted.

=====
grammar foo;

file	: (block|NEWLINE)*;
block	: asblock
	| netblock;
asblock	: asbegin asline* NEWLINE;
netblock: netbegin netline* NEWLINE;
netline	: n_nh;
netbegin: 'NetHandle:' words;
n_nh	: 'NOCHandle:' words;
asline	: 'Comment:' words;
asbegin	: 'ASHandle:' words;
words 	: word (word)* NEWLINE
	| NEWLINE;
word	: WORD;
WORD    : ~(' '|'\t'|'\r'|'\n')+;
NEWLINE : '\r'?'\n';
WS	: (' '|'\t') { skip(); };
=====

... Now, consider the derivation of the line:

	Comment: NOCHandle John Q. Hacker

... starting from rule asline.  asline derives out to 'Comment:' on the
left, words on the right, and from there straight to NoViableAltException.

However, if I change it to:

	Comment: NCHandle John Q. Hacker

... then it derives successfully.

It appears that when trying to derive the words rule, it sees that rule
n_nh could also apply and can't decide what derivation to use.  But why?
 n_nh is not listed as a child rule of words.  How can I fix this so
that the words rule will grab *everything* to the end of the line?

My second concern: when trying to parse a multi-gig file using a grammar
much like the above, Java demands it be given absurdly huge heap sizes.
 I am assuming that like most compilers ANTLR has to construct the
entire tree in memory before it can walk the tree doing various actions:
however, if there's some way to mitigate the heap memory problem, I
would be deeply appreciative.

Thank you all for your help!



More information about the antlr-interest mailing list