[antlr-interest] Greedy matching to end of line

Fri Jan 28 01:32:57 PST 2011

Hi,

I cannot reproduce this using your supplied grammar: as long as the required
NEWLINE is in place, your example works just fine. If, however, I do not
provide a newline in the input, I'm hit by a NoViableAltException.

I.e. for input "Comment: NOCHandle John Q. Hacker" I get the result you
describe, while input
"Comment: NOCHandle John Q. Hacker
" works perfectly, which seems reasonable. Same result for NCHandle.

This, of course, if starting from rule asline.

Am I missing something?

Cheers,
Pop

On Fri, Jan 28, 2011 at 8:51 AM, Robert J. Hansen <rjh at sixdemonbag.org>wrote:

> I haven't done any work with lexers and parsers in many years, and
> figured a good way to go about getting re-acquainted would be to find a
> big corpus of text and put together a translator.  The corpus I had
> around was the ARIN WHOIS information, which is basically key-value
> coding in a record-based format.  Newlines are significant, but other
> whitespace generally isn't.
>
> I'm now running into a brick wall, though, with trying to enable greedy
> matching -- scarfing up everything to end-of-line and returning that
> back as a string.  I can *almost* do it, but I'm getting killed on some
> corner cases.
>
> The following is an abbreviated version of the grammar.  The bug is
> present in this, but all actions, etc., have been omitted.
>
> =====
> grammar foo;
>
> file    : (block|NEWLINE)*;
> block   : asblock
>        | netblock;
> asblock : asbegin asline* NEWLINE;
> netblock: netbegin netline* NEWLINE;
> netline : n_nh;
> netbegin: 'NetHandle:' words;
> n_nh    : 'NOCHandle:' words;
> asline  : 'Comment:' words;
> asbegin : 'ASHandle:' words;
> words   : word (word)* NEWLINE
>        | NEWLINE;
> word    : WORD;
> WORD    : ~(' '|'\t'|'\r'|'\n')+;
> NEWLINE : '\r'?'\n';
> WS      : (' '|'\t') { skip(); };
> =====
>
> ... Now, consider the derivation of the line:
>
>        Comment: NOCHandle John Q. Hacker
>
> ... starting from rule asline.  asline derives out to 'Comment:' on the
> left, words on the right, and from there straight to NoViableAltException.
>
> However, if I change it to:
>
>        Comment: NCHandle John Q. Hacker
>
> ... then it derives successfully.
>
> It appears that when trying to derive the words rule, it sees that rule
> n_nh could also apply and can't decide what derivation to use.  But why?
>  n_nh is not listed as a child rule of words.  How can I fix this so
> that the words rule will grab *everything* to the end of the line?
>
> My second concern: when trying to parse a multi-gig file using a grammar
> much like the above, Java demands it be given absurdly huge heap sizes.
>  I am assuming that like most compilers ANTLR has to construct the
> entire tree in memory before it can walk the tree doing various actions:
> however, if there's some way to mitigate the heap memory problem, I
> would be deeply appreciative.
>
> Thank you all for your help!
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>