[antlr-interest] Greedy matching to end of line
Robert J. Hansen
rjh at sixdemonbag.org
Thu Jan 27 23:51:58 PST 2011
I haven't done any work with lexers and parsers in many years, and
figured a good way to go about getting re-acquainted would be to find a
big corpus of text and put together a translator. The corpus I had
around was the ARIN WHOIS information, which is basically key-value
coding in a record-based format. Newlines are significant, but other
whitespace generally isn't.
I'm now running into a brick wall, though, with trying to enable greedy
matching -- scarfing up everything to end-of-line and returning that
back as a string. I can *almost* do it, but I'm getting killed on some
corner cases.
The following is an abbreviated version of the grammar. The bug is
present in this, but all actions, etc., have been omitted.
=====
grammar foo;
file : (block|NEWLINE)*;
block : asblock
| netblock;
asblock : asbegin asline* NEWLINE;
netblock: netbegin netline* NEWLINE;
netline : n_nh;
netbegin: 'NetHandle:' words;
n_nh : 'NOCHandle:' words;
asline : 'Comment:' words;
asbegin : 'ASHandle:' words;
words : word (word)* NEWLINE
| NEWLINE;
word : WORD;
WORD : ~(' '|'\t'|'\r'|'\n')+;
NEWLINE : '\r'?'\n';
WS : (' '|'\t') { skip(); };
=====
... Now, consider the derivation of the line:
Comment: NOCHandle John Q. Hacker
... starting from rule asline. asline derives out to 'Comment:' on the
left, words on the right, and from there straight to NoViableAltException.
However, if I change it to:
Comment: NCHandle John Q. Hacker
... then it derives successfully.
It appears that when trying to derive the words rule, it sees that rule
n_nh could also apply and can't decide what derivation to use. But why?
n_nh is not listed as a child rule of words. How can I fix this so
that the words rule will grab *everything* to the end of the line?
My second concern: when trying to parse a multi-gig file using a grammar
much like the above, Java demands it be given absurdly huge heap sizes.
I am assuming that like most compilers ANTLR has to construct the
entire tree in memory before it can walk the tree doing various actions:
however, if there's some way to mitigate the heap memory problem, I
would be deeply appreciative.
Thank you all for your help!
More information about the antlr-interest
mailing list