[antlr-interest] big lexer problem

Zhaohui Yang yezonghui at gmail.com
Wed Aug 15 20:43:26 PDT 2012


And I'm sorry I can't provide the grammar source for the moment. I'm
waiting for permission from my company.

2012/8/16 Zhaohui Yang <yezonghui at gmail.com>

> I admit that my grammar was not well designed in the first place. And I'm
> working on it.
>
> However, lexer state is not that evil a thing anyway. At least it simplify
> things conceptually. As for this example of PICTURE string, if I use a
> parser rule pic_string to capture that, I'll have to imagine all kinds of
> tokens/parser rules that may combine into a pic_string. For example,
> "$AX(9).99" would be a "$", an array(index) expression, and a decimal
> number starting with dot. This could be frustrating enough.
>
> Well, I'm still trying to modify the lexer so that the pic_string could be
> combination of simple tokens. One question is how do I ensure these tokens
> does not have spaces between them?
>
> Back to lexer state thing. I found that ANTLR 2.7 has a
> TokenStreamSelector for exactly this purpose. And it can result in smaller
> lexer classes since each lexer cares for its own DFA, not poluting each
> other.
>
> I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(
>
>  2012/8/16 Jim Idle <jimi at temporal-wave.com>
>
>> This really means that your lexer is too complicated and I suspect that
>> you are just trying to type in a grammar from a normative spec without
>> thinking ahead a little (not trying to insult you here). The specs are
>> usually designed to explain the language/syntax, not necessarily to be
>> copied straight in to a parser grammar.
>>
>> You should really post your grammar files to get better help, but
>> generally you are trying to introduce context/state in to the lexer, which
>> is not necessary in all but a few cases. For instance, why do you care
>> about the token type in the lexer if the same pattern is used for two
>> token types? Take a token that matches a PIC pattern generally, then
>> verify that the pattern is a good PIC spec when you are walking the tree,
>> not in the lexer.
>>
>> On top of this, if you are trying to drive the lexer state from the
>> parser, then it is very unlikely it will work anyway.
>>
>> Try to take a step back, and reduce the number of tokens to a minimum,
>> remove any state that you can, move all the error checking and validation
>> as far away from the lexer as you can (at the lexer level you have a
>> minimum context, at the tree walk level you have much more information and
>> can issue much better errors/warnings).
>>
>> Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
>> the grammar and spend time on left factoring the rules, that all/a lot of
>> your problems will go away. If you still have issues with generated code
>> size at that point, then you need to start importing grammars and
>> debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
>> trying to change the output of ANTLR. The only time I have had to use
>> imports is for a full TSQL grammar, which is huge because SQL is so
>> terrible. COBOL is pretty big, but nothing like SQL.
>>
>>
>> Jim
>>
>>
>>
>>
>>
>>
>> > -----Original Message-----
>> > From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> > bounces at antlr.org] On Behalf Of Zhaohui Yang
>> > Sent: Wednesday, August 15, 2012 8:18 AM
>> > To: antlr-interest at antlr.org
>> > Subject: [antlr-interest] big lexer problem
>> >
>>  > Hi,
>> >
>> > I'm having big problem with big generated Lexer.java. Any help
>> > appreciated.
>> >
>> > The language is COBOL. And I found multiple reasons that the lexer
>> > get's too big:
>> >
>> > 1. I'm adding semantic predicate into the lexer, to simulate "lexer
>> > state"
>> > as in YACC and JavaCC. It's like
>> >
>> >        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
>> > matching things like AXX(9).99 after a 'PIC' key word
>> >
>> >    The lexer without semantic predicates is 18K lines.
>> >    When I add predicates to one or two of the lexer rules, it grows to
>> > more than 20K.
>> >    When I add a single more, it explodes to more than 60K and ANTLR
>> > give up generating lexer with error: code is too long.
>> >
>> > 2. COBOL has a LOT of key words, that may explain the original 18K
>> > lines.
>> >
>> > 3. I have tokens referencing other tokens.
>> >    I've inlined most of them now, as suggested by others. But the size
>> > has not reduced much.
>> >
>> > So the question could be:
>> > 1. how to generate smaller lexer without removing semantic predicate?
>> > 2. If that's not possible, how to simulate "lexer state" without
>> > semantic predicate?
>> > 3. Any other solution?
>> >
>> > Thanks.
>> >
>> > --
>> > Regards,
>> >
>> > Yang, Zhaohui
>> >
>>  > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> > email-address
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe:
>> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
>
>
> --
> Regards,
>
> Yang, Zhaohui
>
>


-- 
Regards,

Yang, Zhaohui


More information about the antlr-interest mailing list