[antlr-interest] big lexer problem

Wed Aug 15 20:52:46 PDT 2012

You can use embedded lexers/parsers if you like. I have done that a bunch of times for similar issues. 

However you are over complicating the pic thing I think. Just read all the tokens and concat the contents till you hit a white space. Then verify the pic afterwards. Your error messages will be loved by your users. 

Jim 

On Aug 15, 2012, at 8:40 PM, Zhaohui Yang <yezonghui at gmail.com> wrote:

> I admit that my grammar was not well designed in the first place. And I'm working on it.
>  
> However, lexer state is not that evil a thing anyway. At least it simplify things conceptually. As for this example of PICTURE string, if I use a parser rule pic_string to capture that, I'll have to imagine all kinds of tokens/parser rules that may combine into a pic_string. For example, "$AX(9).99" would be a "$", an array(index) expression, and a decimal number starting with dot. This could be frustrating enough.
>  
> Well, I'm still trying to modify the lexer so that the pic_string could be combination of simple tokens. One question is how do I ensure these tokens does not have spaces between them?
>  
> Back to lexer state thing. I found that ANTLR 2.7 has a TokenStreamSelector for exactly this purpose. And it can result in smaller lexer classes since each lexer cares for its own DFA, not poluting each other.
>  
> I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(   
> 
> 2012/8/16 Jim Idle <jimi at temporal-wave.com>
>> This really means that your lexer is too complicated and I suspect that
>> you are just trying to type in a grammar from a normative spec without
>> thinking ahead a little (not trying to insult you here). The specs are
>> usually designed to explain the language/syntax, not necessarily to be
>> copied straight in to a parser grammar.
>> 
>> You should really post your grammar files to get better help, but
>> generally you are trying to introduce context/state in to the lexer, which
>> is not necessary in all but a few cases. For instance, why do you care
>> about the token type in the lexer if the same pattern is used for two
>> token types? Take a token that matches a PIC pattern generally, then
>> verify that the pattern is a good PIC spec when you are walking the tree,
>> not in the lexer.
>> 
>> On top of this, if you are trying to drive the lexer state from the
>> parser, then it is very unlikely it will work anyway.
>> 
>> Try to take a step back, and reduce the number of tokens to a minimum,
>> remove any state that you can, move all the error checking and validation
>> as far away from the lexer as you can (at the lexer level you have a
>> minimum context, at the tree walk level you have much more information and
>> can issue much better errors/warnings).
>> 
>> Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
>> the grammar and spend time on left factoring the rules, that all/a lot of
>> your problems will go away. If you still have issues with generated code
>> size at that point, then you need to start importing grammars and
>> debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
>> trying to change the output of ANTLR. The only time I have had to use
>> imports is for a full TSQL grammar, which is huge because SQL is so
>> terrible. COBOL is pretty big, but nothing like SQL.
>> 
>> 
>> Jim
>> 
>> 
>> 
>> 
>> 
>> 
>> > -----Original Message-----
>> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> > bounces at antlr.org] On Behalf Of Zhaohui Yang
>> > Sent: Wednesday, August 15, 2012 8:18 AM
>> > To: antlr-interest at antlr.org
>> > Subject: [antlr-interest] big lexer problem
>> >
>> > Hi,
>> >
>> > I'm having big problem with big generated Lexer.java. Any help
>> > appreciated.
>> >
>> > The language is COBOL. And I found multiple reasons that the lexer
>> > get's too big:
>> >
>> > 1. I'm adding semantic predicate into the lexer, to simulate "lexer
>> > state"
>> > as in YACC and JavaCC. It's like
>> >
>> >        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
>> > matching things like AXX(9).99 after a 'PIC' key word
>> >
>> >    The lexer without semantic predicates is 18K lines.
>> >    When I add predicates to one or two of the lexer rules, it grows to
>> > more than 20K.
>> >    When I add a single more, it explodes to more than 60K and ANTLR
>> > give up generating lexer with error: code is too long.
>> >
>> > 2. COBOL has a LOT of key words, that may explain the original 18K
>> > lines.
>> >
>> > 3. I have tokens referencing other tokens.
>> >    I've inlined most of them now, as suggested by others. But the size
>> > has not reduced much.
>> >
>> > So the question could be:
>> > 1. how to generate smaller lexer without removing semantic predicate?
>> > 2. If that's not possible, how to simulate "lexer state" without
>> > semantic predicate?
>> > 3. Any other solution?
>> >
>> > Thanks.
>> >
>> > --
>> > Regards,
>> >
>> > Yang, Zhaohui
>> >
>> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> > email-address
>> 
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> 
> 
> 
> -- 
> Regards,
> 
> Yang, Zhaohui
>