[antlr-interest] big lexer problem

Thu Aug 16 09:10:04 PDT 2012

Look for the “island grammar” example in the downloadable tar of example
grammars. That search term should give you a few examples too. Island
grammars are useful when the language change is detectable purely in the
lexer.

Creating a single PIC token is just fine, but you can also leave PIC on its
own, then in the parser:

… PIC { call a function to consume tokens until a whitespace (might be off
channel) } …

I think that your single token is probably the ‘correct’ way in this case,
but sometimes the parser solution works better (when the lexer cannot
handle such a token on its own).

Jim

*From:* Zhaohui Yang [mailto:yezonghui at gmail.com]
*Sent:* Thursday, August 16, 2012 8:13 AM
*To:* Jim Idle
*Cc:* antlr-interest at antlr.org
*Subject:* Re: [antlr-interest] big lexer problem

Ah, I guess I got the idea of not doing semantic analysis in lexer. We're
now defining the sequence "PIC xxxx-any-thing-without-white-space-xxxx" as
a single token. That totally removed the need for PICTURE_STATE.

Would you please point me to some guide or reference on embedding
lexers/parsers in ANTLR v3 ? I guess we still need that for embedded SQL
and embedded CISC, etc.

2012/8/16 Jim Idle <jimi at temporal-wave.com>

You can use embedded lexers/parsers if you like. I have done that a bunch
of times for similar issues.

However you are over complicating the pic thing I think. Just read all the
tokens and concat the contents till you hit a white space. Then verify the
pic afterwards. Your error messages will be loved by your users.

Jim

On Aug 15, 2012, at 8:40 PM, Zhaohui Yang <yezonghui at gmail.com> wrote:

I admit that my grammar was not well designed in the first place. And I'm
working on it.

However, lexer state is not that evil a thing anyway. At least it simplify
things conceptually. As for this example of PICTURE string, if I use a
parser rule pic_string to capture that, I'll have to imagine all kinds of
tokens/parser rules that may combine into a pic_string. For example,
"$AX(9).99" would be a "$", an array(index) expression, and a decimal
number starting with dot. This could be frustrating enough.

Well, I'm still trying to modify the lexer so that the pic_string could be
combination of simple tokens. One question is how do I ensure these tokens
does not have spaces between them?

Back to lexer state thing. I found that ANTLR 2.7 has a TokenStreamSelector
for exactly this purpose. And it can result in smaller lexer classes since
each lexer cares for its own DFA, not poluting each other.

I realy like to see this TokenStreamSelector in ANTLR 3. Realy ! :(

2012/8/16 Jim Idle <jimi at temporal-wave.com>

This really means that your lexer is too complicated and I suspect that
you are just trying to type in a grammar from a normative spec without
thinking ahead a little (not trying to insult you here). The specs are
usually designed to explain the language/syntax, not necessarily to be
copied straight in to a parser grammar.

You should really post your grammar files to get better help, but
generally you are trying to introduce context/state in to the lexer, which
is not necessary in all but a few cases. For instance, why do you care
about the token type in the lexer if the same pattern is used for two
token types? Take a token that matches a PIC pattern generally, then
verify that the pattern is a good PIC spec when you are walking the tree,
not in the lexer.

On top of this, if you are trying to drive the lexer state from the
parser, then it is very unlikely it will work anyway.

Try to take a step back, and reduce the number of tokens to a minimum,
remove any state that you can, move all the error checking and validation
as far away from the lexer as you can (at the lexer level you have a
minimum context, at the tree walk level you have much more information and
can issue much better errors/warnings).

Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
the grammar and spend time on left factoring the rules, that all/a lot of
your problems will go away. If you still have issues with generated code
size at that point, then you need to start importing grammars and
debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
trying to change the output of ANTLR. The only time I have had to use
imports is for a full TSQL grammar, which is huge because SQL is so
terrible. COBOL is pretty big, but nothing like SQL.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Zhaohui Yang
> Sent: Wednesday, August 15, 2012 8:18 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] big lexer problem
>

> Hi,
>
> I'm having big problem with big generated Lexer.java. Any help
> appreciated.
>
> The language is COBOL. And I found multiple reasons that the lexer
> get's too big:
>
> 1. I'm adding semantic predicate into the lexer, to simulate "lexer
> state"
> as in YACC and JavaCC. It's like
>
>        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
> matching things like AXX(9).99 after a 'PIC' key word
>
>    The lexer without semantic predicates is 18K lines.
>    When I add predicates to one or two of the lexer rules, it grows to
> more than 20K.
>    When I add a single more, it explodes to more than 60K and ANTLR
> give up generating lexer with error: code is too long.
>
> 2. COBOL has a LOT of key words, that may explain the original 18K
> lines.
>
> 3. I have tokens referencing other tokens.
>    I've inlined most of them now, as suggested by others. But the size
> has not reduced much.
>
> So the question could be:
> 1. how to generate smaller lexer without removing semantic predicate?
> 2. If that's not possible, how to simulate "lexer state" without
> semantic predicate?
> 3. Any other solution?
>
> Thanks.
>
> --
> Regards,
>
> Yang, Zhaohui
>

> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
Regards,

Yang, Zhaohui

-- 
Regards,

Yang, Zhaohui