[antlr-interest] big lexer problem

Wed Aug 15 20:40:53 PDT 2012

I admit that my grammar was not well designed in the first place. And I'm
working on it.

However, lexer state is not that evil a thing anyway. At least it simplify
things conceptually. As for this example of PICTURE string, if I use a
parser rule pic_string to capture that, I'll have to imagine all kinds of
tokens/parser rules that may combine into a pic_string. For example,
"$AX(9).99" would be a "$", an array(index) expression, and a decimal
number starting with dot. This could be frustrating enough.

Well, I'm still trying to modify the lexer so that the pic_string could be
combination of simple tokens. One question is how do I ensure these tokens
does not have spaces between them?

Back to lexer state thing. I found that ANTLR 2.7 has a TokenStreamSelector
for exactly this purpose. And it can result in smaller lexer classes since
each lexer cares for its own DFA, not poluting each other.

I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(

2012/8/16 Jim Idle <jimi at temporal-wave.com>

> This really means that your lexer is too complicated and I suspect that
> you are just trying to type in a grammar from a normative spec without
> thinking ahead a little (not trying to insult you here). The specs are
> usually designed to explain the language/syntax, not necessarily to be
> copied straight in to a parser grammar.
>
> You should really post your grammar files to get better help, but
> generally you are trying to introduce context/state in to the lexer, which
> is not necessary in all but a few cases. For instance, why do you care
> about the token type in the lexer if the same pattern is used for two
> token types? Take a token that matches a PIC pattern generally, then
> verify that the pattern is a good PIC spec when you are walking the tree,
> not in the lexer.
>
> On top of this, if you are trying to drive the lexer state from the
> parser, then it is very unlikely it will work anyway.
>
> Try to take a step back, and reduce the number of tokens to a minimum,
> remove any state that you can, move all the error checking and validation
> as far away from the lexer as you can (at the lexer level you have a
> minimum context, at the tree walk level you have much more information and
> can issue much better errors/warnings).
>
> Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
> the grammar and spend time on left factoring the rules, that all/a lot of
> your problems will go away. If you still have issues with generated code
> size at that point, then you need to start importing grammars and
> debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
> trying to change the output of ANTLR. The only time I have had to use
> imports is for a full TSQL grammar, which is huge because SQL is so
> terrible. COBOL is pretty big, but nothing like SQL.
>
>
> Jim
>
>
>
>
>
>
> > -----Original Message-----
> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > bounces at antlr.org] On Behalf Of Zhaohui Yang
> > Sent: Wednesday, August 15, 2012 8:18 AM
> > To: antlr-interest at antlr.org
> > Subject: [antlr-interest] big lexer problem
> >
>  > Hi,
> >
> > I'm having big problem with big generated Lexer.java. Any help
> > appreciated.
> >
> > The language is COBOL. And I found multiple reasons that the lexer
> > get's too big:
> >
> > 1. I'm adding semantic predicate into the lexer, to simulate "lexer
> > state"
> > as in YACC and JavaCC. It's like
> >
> >        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
> > matching things like AXX(9).99 after a 'PIC' key word
> >
> >    The lexer without semantic predicates is 18K lines.
> >    When I add predicates to one or two of the lexer rules, it grows to
> > more than 20K.
> >    When I add a single more, it explodes to more than 60K and ANTLR
> > give up generating lexer with error: code is too long.
> >
> > 2. COBOL has a LOT of key words, that may explain the original 18K
> > lines.
> >
> > 3. I have tokens referencing other tokens.
> >    I've inlined most of them now, as suggested by others. But the size
> > has not reduced much.
> >
> > So the question could be:
> > 1. how to generate smaller lexer without removing semantic predicate?
> > 2. If that's not possible, how to simulate "lexer state" without
> > semantic predicate?
> > 3. Any other solution?
> >
> > Thanks.
> >
> > --
> > Regards,
> >
> > Yang, Zhaohui
> >
>  > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> > email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

-- 
Regards,

Yang, Zhaohui