[antlr-interest] java.lang.OutOfMemoryError: Java heap space

Wed Jun 6 13:01:53 PDT 2007

> fragment DEC_OCTET
>  : DIGIT               // 0-9
>   | '1'..'9' DIGIT      // 10-99
>   | '1' DIGIT DIGIT     // 100-199
>  | '2' '0'..'4' DIGIT  // 200-249
>   | '25' '0'..'5'       // 250-255
>   ;
Have to wonder whether this is really a smart thing to do. You're using a 
lexer to enforce a semantic restriction: namely that a DEC_OCTET must have a 
value between of 0 to 255.

>From an efficiency point of view, wouldn't it make sense to go for
   DEC_OCTET: DIGIT+;  (not a fragment!)
and then build addresses at the parser level rather than at the lexer level, 
and enforce semantic restrictions either with predicates, or (even better, I 
think) in the processing code.

One of the downsides of this kind of semantic enforcement lexically is that 
you end up with crazy error messages like :

   Input: 1.1.257.1
    Error: Expecting ".", "0", "1", "2", "3", "4", or "5".
(not a very helpful error message, in my opinion).

If handle this error in a semantic level then you can provide much more 
semantically relevant error messages like:
   "Malformed IPv4 address".

Not knowing the details of the ANTLR DFA implementation, I have to think 
that the amount of state information that a DFA has to track is going to be 
crazy by the time you get a few octets into an IPv4 address. It doesn't 
surprise me that the size of the lexer DFA goes ballistic, though.

I'm struggling with this in some of the sample grammars. I can't help 
thinking (for example) that it's a very bad idea to treat   "\z" in a 
C/C++/C# string as a lexical error ("not expecting 'z') rather than a 
semantic error ("illegal escape sequence").

----- Original Message ----- 
From: "Wincent Colaiuta" <win at wincent.com>
To: "ANTLR mail-list" <antlr-interest at antlr.org>
Sent: Wednesday, June 06, 2007 6:03 AM
Subject: Re: [antlr-interest] java.lang.OutOfMemoryError: Java heap space

El 5/6/2007, a las 21:18, Wincent Colaiuta escribió:

> El 5/6/2007, a las 21:04, Terence Parr escribió:
>
>> Hmm...there is supposed to be a fail safe in their bad times out  if it 
>> takes too long to build a DFA. is the grammar very big?
>>
>>  perhaps you guys can e-mail me or grammars separately and I can  try 
>> them out.
>
> The grammar can be viewed here:
>
>   <http://pastie.textmate.org/68006>
>
> Or directly downloaded here:
>
>   <http://pastie.textmate.org/68006/download>

Ok, so I've done a bit more investigation with the following results:

- removing references to the IPV6_ADDRESS rule is enough to prevent
the out-of-memory errors; this is a nasty rule and I need to find
some better way of describing it: for the record the original ABNF
grammar in the RFC describes it as follows (not very pretty):

       IPv6address =                            6( h16 ":" ) ls32
                   /                       "::" 5( h16 ":" ) ls32
                   / [               h16 ] "::" 4( h16 ":" ) ls32
                   / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                   / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                   / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                   / [ *4( h16 ":" ) h16 ] "::"              ls32
                   / [ *5( h16 ":" ) h16 ] "::"              h16
                   / [ *6( h16 ":" ) h16 ] "::"

       ls32        = ( h16 ":" h16 ) / IPv4address

       h16         = 1*4HEXDIG

- even after removing the IPV6_ADDRESS rule the generated lexer is
still huge (290K lines of C, or 181K lines of Java)

- visually inspecting the generated file shows that most of the
methods are no more than a few hundred lines long and the vast
majority of them are much less

- the vast bulk of the file is occupied by the HOST rule (179K lines
of Java):

fragment HOST
   : IP_LITERAL
   | (IPV4_ADDRESS)=> IPV4_ADDRESS
   | REG_NAME
   ;

- removing the IPV4_ADDRESS subrule causes the line count for the
lexer to drop down to a much more reasonable 2K lines of Java; for
reference, the IPV4_ADDRESS subrule is as follows:

fragment IPV4_ADDRESS: DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.'
DEC_OCTET;
fragment DEC_OCTET
   : DIGIT               // 0-9
   | '1'..'9' DIGIT      // 10-99
   | '1' DIGIT DIGIT     // 100-199
   | '2' '0'..'4' DIGIT  // 200-249
   | '25' '0'..'5'       // 250-255
   ;

- it's not the IPV4_ADDRESS subrule in itself which is the problem,
rather the way it interacts with the REG_NAME subrule; I can confirm
this because removing the REG_NAME subrule instead of the
IPV4_ADDRESS has the same effect (lexer stays small)

- adding the IPV6_ADDRESS rule back in causes the out-of-memory error
to return immediately

So I've narrowed down the source of the problem quite a bit. I'm not
sure why the interaction between the IPV4_ADDRESS and REG_NAME
subrules would cause such a problem; the RFC acknowledges that they
are ambiguous but it also specifies that IPV4_ADDRESS should be
preferred, so the syntactic predicate should handle that. Given that
I am not interested in the internal structure of the URIs (I only
want to recognize them and move on) I can probably just drop the
reference to IPV4_ADDRESS in the HOST rule because REG_NAME will
match IPv4 addresses anyway.

Does anyone now how I could express the IPV6_ADDRESS rule in a way
that won't cause ANTLR to choke?

Cheers,
Wincent