[antlr-interest] java.lang.OutOfMemoryError: Java heap space

Wed Jun 6 08:16:17 PDT 2007

I think you just want to move most of this to the parser and be done with it Wincent. As you are not trying to do anything with the URI, just recognize it, then complicating the lexer so you can have one URI token does not get you anywhere. Instead of using 'URI' in your parser, you just use 'uri'. I don't think it is analysis bugs, I think it is just that you have produced a massively complicated lexer.

On the number of lines generated, the C output contains a lot of whitespace, comments (especially in lexer rules) and of course formatting of '{' in C style and so will make you feel you are getting more code lines than you actually are, but you will still need lots of them for this lexer!

If you really want to do this in the lexer, then maybe you should take all of the lexer rules to do with the URI stuff and put them into a new parser/lexer combination that you can embed within your main one.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Wincent Colaiuta
> Sent: Wednesday, June 06, 2007 3:04 AM
> To: ANTLR mail-list
> Subject: Re: [antlr-interest] java.lang.OutOfMemoryError: Java heap
> space
> 
> El 5/6/2007, a las 21:18, Wincent Colaiuta escribió:
> 
> > El 5/6/2007, a las 21:04, Terence Parr escribió:
> >
> >> Hmm...there is supposed to be a fail safe in their bad times out
> >> if it takes too long to build a DFA. is the grammar very big?
> >>
> >>  perhaps you guys can e-mail me or grammars separately and I can
> >> try them out.
> >
> > The grammar can be viewed here:
> >
> >   <http://pastie.textmate.org/68006>
> >
> > Or directly downloaded here:
> >
> >   <http://pastie.textmate.org/68006/download>
> 
> Ok, so I've done a bit more investigation with the following results:
> 
> - removing references to the IPV6_ADDRESS rule is enough to prevent
> the out-of-memory errors; this is a nasty rule and I need to find
> some better way of describing it: for the record the original ABNF
> grammar in the RFC describes it as follows (not very pretty):
> 
>        IPv6address =                            6( h16 ":" ) ls32
>                    /                       "::" 5( h16 ":" ) ls32
>                    / [               h16 ] "::" 4( h16 ":" ) ls32
>                    / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
>                    / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
>                    / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
>                    / [ *4( h16 ":" ) h16 ] "::"              ls32
>                    / [ *5( h16 ":" ) h16 ] "::"              h16
>                    / [ *6( h16 ":" ) h16 ] "::"
> 
>        ls32        = ( h16 ":" h16 ) / IPv4address
> 
>        h16         = 1*4HEXDIG
> 
> - even after removing the IPV6_ADDRESS rule the generated lexer is
> still huge (290K lines of C, or 181K lines of Java)
> 
> - visually inspecting the generated file shows that most of the
> methods are no more than a few hundred lines long and the vast
> majority of them are much less
> 
> - the vast bulk of the file is occupied by the HOST rule (179K lines
> of Java):
> 
> fragment HOST
>    : IP_LITERAL
>    | (IPV4_ADDRESS)=> IPV4_ADDRESS
>    | REG_NAME
>    ;
> 
> - removing the IPV4_ADDRESS subrule causes the line count for the
> lexer to drop down to a much more reasonable 2K lines of Java; for
> reference, the IPV4_ADDRESS subrule is as follows:
> 
> fragment IPV4_ADDRESS: DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.'
> DEC_OCTET;
> fragment DEC_OCTET
>    : DIGIT               // 0-9
>    | '1'..'9' DIGIT      // 10-99
>    | '1' DIGIT DIGIT     // 100-199
>    | '2' '0'..'4' DIGIT  // 200-249
>    | '25' '0'..'5'       // 250-255
>    ;
> 
> - it's not the IPV4_ADDRESS subrule in itself which is the problem,
> rather the way it interacts with the REG_NAME subrule; I can confirm
> this because removing the REG_NAME subrule instead of the
> IPV4_ADDRESS has the same effect (lexer stays small)
> 
> - adding the IPV6_ADDRESS rule back in causes the out-of-memory error
> to return immediately
> 
> So I've narrowed down the source of the problem quite a bit. I'm not
> sure why the interaction between the IPV4_ADDRESS and REG_NAME
> subrules would cause such a problem; the RFC acknowledges that they
> are ambiguous but it also specifies that IPV4_ADDRESS should be
> preferred, so the syntactic predicate should handle that. Given that
> I am not interested in the internal structure of the URIs (I only
> want to recognize them and move on) I can probably just drop the
> reference to IPV4_ADDRESS in the HOST rule because REG_NAME will
> match IPv4 addresses anyway.
> 
> Does anyone now how I could express the IPV6_ADDRESS rule in a way
> that won't cause ANTLR to choke?
> 
> Cheers,
> Wincent