[antlr-interest] big lexer problem

Wed Aug 15 20:56:14 PDT 2012

That's ok, commercial projects must be careful. As much as anything, such grammars may generate some ideas for v4. 

Jim

On Aug 15, 2012, at 8:43 PM, Zhaohui Yang <yezonghui at gmail.com> wrote:

> And I'm sorry I can't provide the grammar source for the moment. I'm waiting for permission from my company.
> 
> 2012/8/16 Zhaohui Yang <yezonghui at gmail.com>
>> I admit that my grammar was not well designed in the first place. And I'm working on it.
>>  
>> However, lexer state is not that evil a thing anyway. At least it simplify things conceptually. As for this example of PICTURE string, if I use a parser rule pic_string to capture that, I'll have to imagine all kinds of tokens/parser rules that may combine into a pic_string. For example, "$AX(9).99" would be a "$", an array(index) expression, and a decimal number starting with dot. This could be frustrating enough.
>>  
>> Well, I'm still trying to modify the lexer so that the pic_string could be combination of simple tokens. One question is how do I ensure these tokens does not have spaces between them?
>>  
>> Back to lexer state thing. I found that ANTLR 2.7 has a TokenStreamSelector for exactly this purpose. And it can result in smaller lexer classes since each lexer cares for its own DFA, not poluting each other.
>>  
>> I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(   
>> 
>> 2012/8/16 Jim Idle <jimi at temporal-wave.com>
>>> This really means that your lexer is too complicated and I suspect that
>>> you are just trying to type in a grammar from a normative spec without
>>> thinking ahead a little (not trying to insult you here). The specs are
>>> usually designed to explain the language/syntax, not necessarily to be
>>> copied straight in to a parser grammar.
>>> 
>>> You should really post your grammar files to get better help, but
>>> generally you are trying to introduce context/state in to the lexer, which
>>> is not necessary in all but a few cases. For instance, why do you care
>>> about the token type in the lexer if the same pattern is used for two
>>> token types? Take a token that matches a PIC pattern generally, then
>>> verify that the pattern is a good PIC spec when you are walking the tree,
>>> not in the lexer.
>>> 
>>> On top of this, if you are trying to drive the lexer state from the
>>> parser, then it is very unlikely it will work anyway.
>>> 
>>> Try to take a step back, and reduce the number of tokens to a minimum,
>>> remove any state that you can, move all the error checking and validation
>>> as far away from the lexer as you can (at the lexer level you have a
>>> minimum context, at the tree walk level you have much more information and
>>> can issue much better errors/warnings).
>>> 
>>> Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
>>> the grammar and spend time on left factoring the rules, that all/a lot of
>>> your problems will go away. If you still have issues with generated code
>>> size at that point, then you need to start importing grammars and
>>> debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
>>> trying to change the output of ANTLR. The only time I have had to use
>>> imports is for a full TSQL grammar, which is huge because SQL is so
>>> terrible. COBOL is pretty big, but nothing like SQL.
>>> 
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> > -----Original Message-----
>>> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>>> > bounces at antlr.org] On Behalf Of Zhaohui Yang
>>> > Sent: Wednesday, August 15, 2012 8:18 AM
>>> > To: antlr-interest at antlr.org
>>> > Subject: [antlr-interest] big lexer problem
>>> >
>>> > Hi,
>>> >
>>> > I'm having big problem with big generated Lexer.java. Any help
>>> > appreciated.
>>> >
>>> > The language is COBOL. And I found multiple reasons that the lexer
>>> > get's too big:
>>> >
>>> > 1. I'm adding semantic predicate into the lexer, to simulate "lexer
>>> > state"
>>> > as in YACC and JavaCC. It's like
>>> >
>>> >        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
>>> > matching things like AXX(9).99 after a 'PIC' key word
>>> >
>>> >    The lexer without semantic predicates is 18K lines.
>>> >    When I add predicates to one or two of the lexer rules, it grows to
>>> > more than 20K.
>>> >    When I add a single more, it explodes to more than 60K and ANTLR
>>> > give up generating lexer with error: code is too long.
>>> >
>>> > 2. COBOL has a LOT of key words, that may explain the original 18K
>>> > lines.
>>> >
>>> > 3. I have tokens referencing other tokens.
>>> >    I've inlined most of them now, as suggested by others. But the size
>>> > has not reduced much.
>>> >
>>> > So the question could be:
>>> > 1. how to generate smaller lexer without removing semantic predicate?
>>> > 2. If that's not possible, how to simulate "lexer state" without
>>> > semantic predicate?
>>> > 3. Any other solution?
>>> >
>>> > Thanks.
>>> >
>>> > --
>>> > Regards,
>>> >
>>> > Yang, Zhaohui
>>> >
>>> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>>> > email-address
>>> 
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> 
>> 
>> 
>> -- 
>> Regards,
>> 
>> Yang, Zhaohui
> 
> 
> 
> -- 
> Regards,
> 
> Yang, Zhaohui
>