[antlr-interest] java.lang.OutOfMemoryError: Java heap space
Wincent Colaiuta
win at wincent.com
Wed Jun 6 03:03:44 PDT 2007
El 5/6/2007, a las 21:18, Wincent Colaiuta escribió:
> El 5/6/2007, a las 21:04, Terence Parr escribió:
>
>> Hmm...there is supposed to be a fail safe in their bad times out
>> if it takes too long to build a DFA. is the grammar very big?
>>
>> perhaps you guys can e-mail me or grammars separately and I can
>> try them out.
>
> The grammar can be viewed here:
>
> <http://pastie.textmate.org/68006>
>
> Or directly downloaded here:
>
> <http://pastie.textmate.org/68006/download>
Ok, so I've done a bit more investigation with the following results:
- removing references to the IPV6_ADDRESS rule is enough to prevent
the out-of-memory errors; this is a nasty rule and I need to find
some better way of describing it: for the record the original ABNF
grammar in the RFC describes it as follows (not very pretty):
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
h16 = 1*4HEXDIG
- even after removing the IPV6_ADDRESS rule the generated lexer is
still huge (290K lines of C, or 181K lines of Java)
- visually inspecting the generated file shows that most of the
methods are no more than a few hundred lines long and the vast
majority of them are much less
- the vast bulk of the file is occupied by the HOST rule (179K lines
of Java):
fragment HOST
: IP_LITERAL
| (IPV4_ADDRESS)=> IPV4_ADDRESS
| REG_NAME
;
- removing the IPV4_ADDRESS subrule causes the line count for the
lexer to drop down to a much more reasonable 2K lines of Java; for
reference, the IPV4_ADDRESS subrule is as follows:
fragment IPV4_ADDRESS: DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.'
DEC_OCTET;
fragment DEC_OCTET
: DIGIT // 0-9
| '1'..'9' DIGIT // 10-99
| '1' DIGIT DIGIT // 100-199
| '2' '0'..'4' DIGIT // 200-249
| '25' '0'..'5' // 250-255
;
- it's not the IPV4_ADDRESS subrule in itself which is the problem,
rather the way it interacts with the REG_NAME subrule; I can confirm
this because removing the REG_NAME subrule instead of the
IPV4_ADDRESS has the same effect (lexer stays small)
- adding the IPV6_ADDRESS rule back in causes the out-of-memory error
to return immediately
So I've narrowed down the source of the problem quite a bit. I'm not
sure why the interaction between the IPV4_ADDRESS and REG_NAME
subrules would cause such a problem; the RFC acknowledges that they
are ambiguous but it also specifies that IPV4_ADDRESS should be
preferred, so the syntactic predicate should handle that. Given that
I am not interested in the internal structure of the URIs (I only
want to recognize them and move on) I can probably just drop the
reference to IPV4_ADDRESS in the HOST rule because REG_NAME will
match IPv4 addresses anyway.
Does anyone now how I could express the IPV6_ADDRESS rule in a way
that won't cause ANTLR to choke?
Cheers,
Wincent
More information about the antlr-interest
mailing list