[antlr-interest] java.lang.OutOfMemoryError: Java heap space
rerdavies at rogers.com
Wed Jun 6 13:01:53 PDT 2007
> fragment DEC_OCTET
> : DIGIT // 0-9
> | '1'..'9' DIGIT // 10-99
> | '1' DIGIT DIGIT // 100-199
> | '2' '0'..'4' DIGIT // 200-249
> | '25' '0'..'5' // 250-255
Have to wonder whether this is really a smart thing to do. You're using a
lexer to enforce a semantic restriction: namely that a DEC_OCTET must have a
value between of 0 to 255.
>From an efficiency point of view, wouldn't it make sense to go for
DEC_OCTET: DIGIT+; (not a fragment!)
and then build addresses at the parser level rather than at the lexer level,
and enforce semantic restrictions either with predicates, or (even better, I
think) in the processing code.
One of the downsides of this kind of semantic enforcement lexically is that
you end up with crazy error messages like :
Error: Expecting ".", "0", "1", "2", "3", "4", or "5".
(not a very helpful error message, in my opinion).
If handle this error in a semantic level then you can provide much more
semantically relevant error messages like:
"Malformed IPv4 address".
Not knowing the details of the ANTLR DFA implementation, I have to think
that the amount of state information that a DFA has to track is going to be
crazy by the time you get a few octets into an IPv4 address. It doesn't
surprise me that the size of the lexer DFA goes ballistic, though.
I'm struggling with this in some of the sample grammars. I can't help
thinking (for example) that it's a very bad idea to treat "\z" in a
C/C++/C# string as a lexical error ("not expecting 'z') rather than a
semantic error ("illegal escape sequence").
----- Original Message -----
From: "Wincent Colaiuta" <win at wincent.com>
To: "ANTLR mail-list" <antlr-interest at antlr.org>
Sent: Wednesday, June 06, 2007 6:03 AM
Subject: Re: [antlr-interest] java.lang.OutOfMemoryError: Java heap space
El 5/6/2007, a las 21:18, Wincent Colaiuta escribió:
> El 5/6/2007, a las 21:04, Terence Parr escribió:
>> Hmm...there is supposed to be a fail safe in their bad times out if it
>> takes too long to build a DFA. is the grammar very big?
>> perhaps you guys can e-mail me or grammars separately and I can try
>> them out.
> The grammar can be viewed here:
> Or directly downloaded here:
Ok, so I've done a bit more investigation with the following results:
- removing references to the IPV6_ADDRESS rule is enough to prevent
the out-of-memory errors; this is a nasty rule and I need to find
some better way of describing it: for the record the original ABNF
grammar in the RFC describes it as follows (not very pretty):
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
h16 = 1*4HEXDIG
- even after removing the IPV6_ADDRESS rule the generated lexer is
still huge (290K lines of C, or 181K lines of Java)
- visually inspecting the generated file shows that most of the
methods are no more than a few hundred lines long and the vast
majority of them are much less
- the vast bulk of the file is occupied by the HOST rule (179K lines
| (IPV4_ADDRESS)=> IPV4_ADDRESS
- removing the IPV4_ADDRESS subrule causes the line count for the
lexer to drop down to a much more reasonable 2K lines of Java; for
reference, the IPV4_ADDRESS subrule is as follows:
fragment IPV4_ADDRESS: DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.'
: DIGIT // 0-9
| '1'..'9' DIGIT // 10-99
| '1' DIGIT DIGIT // 100-199
| '2' '0'..'4' DIGIT // 200-249
| '25' '0'..'5' // 250-255
- it's not the IPV4_ADDRESS subrule in itself which is the problem,
rather the way it interacts with the REG_NAME subrule; I can confirm
this because removing the REG_NAME subrule instead of the
IPV4_ADDRESS has the same effect (lexer stays small)
- adding the IPV6_ADDRESS rule back in causes the out-of-memory error
to return immediately
So I've narrowed down the source of the problem quite a bit. I'm not
sure why the interaction between the IPV4_ADDRESS and REG_NAME
subrules would cause such a problem; the RFC acknowledges that they
are ambiguous but it also specifies that IPV4_ADDRESS should be
preferred, so the syntactic predicate should handle that. Given that
I am not interested in the internal structure of the URIs (I only
want to recognize them and move on) I can probably just drop the
reference to IPV4_ADDRESS in the HOST rule because REG_NAME will
match IPv4 addresses anyway.
Does anyone now how I could express the IPV6_ADDRESS rule in a way
that won't cause ANTLR to choke?
More information about the antlr-interest