[antlr-interest] big lexer problem

Thu Aug 16 12:00:13 PDT 2012

Thanks for pointing to Island grammars :)

And for " call a function to consume tokens until a whitespace (might be
off channel) ", I guess this can be achieved with help of
LA() and consume(). Just don't have the experience how to make a token from
the captured pic string, while keeping the token stream and lexer objects
in consistent state. (e.g. the token index will leave a gap? backtracking
to an earlier point could have problem then)

2012/8/17 Jim Idle <jimi at temporal-wave.com>

> Look for the “island grammar” example in the downloadable tar of example
> grammars. That search term should give you a few examples too. Island
> grammars are useful when the language change is detectable purely in the
> lexer.
>
>
>
> Creating a single PIC token is just fine, but you can also leave PIC on its
> own, then in the parser:
>
>
>
> … PIC { call a function to consume tokens until a whitespace (might be off
> channel) } …
>
>
>
> I think that your single token is probably the ‘correct’ way in this case,
> but sometimes the parser solution works better (when the lexer cannot
> handle such a token on its own).
>
>
>
> Jim
>
>
>
> *From:* Zhaohui Yang [mailto:yezonghui at gmail.com]
> *Sent:* Thursday, August 16, 2012 8:13 AM
> *To:* Jim Idle
> *Cc:* antlr-interest at antlr.org
> *Subject:* Re: [antlr-interest] big lexer problem
>
>
>
> Ah, I guess I got the idea of not doing semantic analysis in lexer. We're
> now defining the sequence "PIC xxxx-any-thing-without-white-space-xxxx" as
> a single token. That totally removed the need for PICTURE_STATE.
>
>
>
> Would you please point me to some guide or reference on embedding
> lexers/parsers in ANTLR v3 ? I guess we still need that for embedded SQL
> and embedded CISC, etc.
>
> 2012/8/16 Jim Idle <jimi at temporal-wave.com>
>
> You can use embedded lexers/parsers if you like. I have done that a bunch
> of times for similar issues.
>
>
>
> However you are over complicating the pic thing I think. Just read all the
> tokens and concat the contents till you hit a white space. Then verify the
> pic afterwards. Your error messages will be loved by your users.
>
>
>
> Jim
>
>
>
>
>
>
>
>
>
> On Aug 15, 2012, at 8:40 PM, Zhaohui Yang <yezonghui at gmail.com> wrote:
>
> I admit that my grammar was not well designed in the first place. And I'm
> working on it.
>
>
>
> However, lexer state is not that evil a thing anyway. At least it simplify
> things conceptually. As for this example of PICTURE string, if I use a
> parser rule pic_string to capture that, I'll have to imagine all kinds of
> tokens/parser rules that may combine into a pic_string. For example,
> "$AX(9).99" would be a "$", an array(index) expression, and a decimal
> number starting with dot. This could be frustrating enough.
>
>
>
> Well, I'm still trying to modify the lexer so that the pic_string could be
> combination of simple tokens. One question is how do I ensure these tokens
> does not have spaces between them?
>
>
>
> Back to lexer state thing. I found that ANTLR 2.7 has a TokenStreamSelector
> for exactly this purpose. And it can result in smaller lexer classes since
> each lexer cares for its own DFA, not poluting each other.
>
>
>
> I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(
>
> 2012/8/16 Jim Idle <jimi at temporal-wave.com>
>
> This really means that your lexer is too complicated and I suspect that
> you are just trying to type in a grammar from a normative spec without
> thinking ahead a little (not trying to insult you here). The specs are
> usually designed to explain the language/syntax, not necessarily to be
> copied straight in to a parser grammar.
>
> You should really post your grammar files to get better help, but
> generally you are trying to introduce context/state in to the lexer, which
> is not necessary in all but a few cases. For instance, why do you care
> about the token type in the lexer if the same pattern is used for two
> token types? Take a token that matches a PIC pattern generally, then
> verify that the pattern is a good PIC spec when you are walking the tree,
> not in the lexer.
>
> On top of this, if you are trying to drive the lexer state from the
> parser, then it is very unlikely it will work anyway.
>
> Try to take a step back, and reduce the number of tokens to a minimum,
> remove any state that you can, move all the error checking and validation
> as far away from the lexer as you can (at the lexer level you have a
> minimum context, at the tree walk level you have much more information and
> can issue much better errors/warnings).
>
> Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
> the grammar and spend time on left factoring the rules, that all/a lot of
> your problems will go away. If you still have issues with generated code
> size at that point, then you need to start importing grammars and
> debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
> trying to change the output of ANTLR. The only time I have had to use
> imports is for a full TSQL grammar, which is huge because SQL is so
> terrible. COBOL is pretty big, but nothing like SQL.
>
>
> Jim
>
>
>
>
>
>
>
> > -----Original Message-----
> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> > bounces at antlr.org] On Behalf Of Zhaohui Yang
> > Sent: Wednesday, August 15, 2012 8:18 AM
> > To: antlr-interest at antlr.org
> > Subject: [antlr-interest] big lexer problem
> >
>
> > Hi,
> >
> > I'm having big problem with big generated Lexer.java. Any help
> > appreciated.
> >
> > The language is COBOL. And I found multiple reasons that the lexer
> > get's too big:
> >
> > 1. I'm adding semantic predicate into the lexer, to simulate "lexer
> > state"
> > as in YACC and JavaCC. It's like
> >
> >        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
> > matching things like AXX(9).99 after a 'PIC' key word
> >
> >    The lexer without semantic predicates is 18K lines.
> >    When I add predicates to one or two of the lexer rules, it grows to
> > more than 20K.
> >    When I add a single more, it explodes to more than 60K and ANTLR
> > give up generating lexer with error: code is too long.
> >
> > 2. COBOL has a LOT of key words, that may explain the original 18K
> > lines.
> >
> > 3. I have tokens referencing other tokens.
> >    I've inlined most of them now, as suggested by others. But the size
> > has not reduced much.
> >
> > So the question could be:
> > 1. how to generate smaller lexer without removing semantic predicate?
> > 2. If that's not possible, how to simulate "lexer state" without
> > semantic predicate?
> > 3. Any other solution?
> >
> > Thanks.
> >
> > --
> > Regards,
> >
> > Yang, Zhaohui
> >
>
> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> > email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>
>
>
> --
> Regards,
>
>
> Yang, Zhaohui
>
>
>
>
>
>
> --
> Regards,
>
>
> Yang, Zhaohui
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

-- 
Regards,

Yang, Zhaohui