[antlr-interest] java.lang.OutOfMemoryError: Java heap space

Wincent Colaiuta win at wincent.com
Wed Jun 6 15:24:09 PDT 2007


El 6/6/2007, a las 22:01, Robin Davies escribió:

>> fragment DEC_OCTET
>>  : DIGIT               // 0-9
>>   | '1'..'9' DIGIT      // 10-99
>>   | '1' DIGIT DIGIT     // 100-199
>>  | '2' '0'..'4' DIGIT  // 200-249
>>   | '25' '0'..'5'       // 250-255
>>   ;
> Have to wonder whether this is really a smart thing to do. You're  
> using a lexer to enforce a semantic restriction: namely that a  
> DEC_OCTET must have a value between of 0 to 255.
>
> From an efficiency point of view, wouldn't it make sense to go for
>   DEC_OCTET: DIGIT+;  (not a fragment!)
> and then build addresses at the parser level rather than at the  
> lexer level, and enforce semantic restrictions either with  
> predicates, or (even better, I think) in the processing code.

I think you're probably right. I'm still trying to come to grips with  
all these boundaries (lexer/parser, terminal/non-terminal, syntactic/ 
semantic etc).

> One of the downsides of this kind of semantic enforcement lexically  
> is that you end up with crazy error messages like :
>
>   Input: 1.1.257.1
>    Error: Expecting ".", "0", "1", "2", "3", "4", or "5".
> (not a very helpful error message, in my opinion).
>
> If handle this error in a semantic level then you can provide much  
> more semantically relevant error messages like:
>   "Malformed IPv4 address".

Yes, I again think you're right. Luckily there's a chapter on error  
handling in Ter's book... will have to study up on it! I also need to  
figure out how (and if) it can be done when using the C target...

> Not knowing the details of the ANTLR DFA implementation, I have to  
> think that the amount of state information that a DFA has to track  
> is going to be crazy by the time you get a few octets into an IPv4  
> address. It doesn't surprise me that the size of the lexer DFA goes  
> ballistic, though.

I think the IPv4 address isn't too crazy, but the IPv6 one definitely  
is... I think you're right that the only way to handle it will be to  
use much loser restrictions at the syntactic level and then check at  
the semantic level.

> I'm struggling with this in some of the sample grammars. I can't  
> help thinking (for example) that it's a very bad idea to treat    
> "\z" in a C/C++/C# string as a lexical error ("not expecting 'z')  
> rather than a semantic error ("illegal escape sequence").

Most definitely...

Cheers,
Wincent





More information about the antlr-interest mailing list