[antlr-interest] big lexer problem

Wed Aug 15 11:32:08 PDT 2012

This really means that your lexer is too complicated and I suspect that
you are just trying to type in a grammar from a normative spec without
thinking ahead a little (not trying to insult you here). The specs are
usually designed to explain the language/syntax, not necessarily to be
copied straight in to a parser grammar.

You should really post your grammar files to get better help, but
generally you are trying to introduce context/state in to the lexer, which
is not necessary in all but a few cases. For instance, why do you care
about the token type in the lexer if the same pattern is used for two
token types? Take a token that matches a PIC pattern generally, then
verify that the pattern is a good PIC spec when you are walking the tree,
not in the lexer.

On top of this, if you are trying to drive the lexer state from the
parser, then it is very unlikely it will work anyway.

Try to take a step back, and reduce the number of tokens to a minimum,
remove any state that you can, move all the error checking and validation
as far away from the lexer as you can (at the lexer level you have a
minimum context, at the tree walk level you have much more information and
can issue much better errors/warnings).

Next, you don't need a 'fix' for ANTLR. You will find that as you simplify
the grammar and spend time on left factoring the rules, that all/a lot of
your problems will go away. If you still have issues with generated code
size at that point, then you need to start importing grammars and
debugging remotely (do not use the interpreter in ANTLRWorks anyway), not
trying to change the output of ANTLR. The only time I have had to use
imports is for a full TSQL grammar, which is huge because SQL is so
terrible. COBOL is pretty big, but nothing like SQL.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Zhaohui Yang
> Sent: Wednesday, August 15, 2012 8:18 AM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] big lexer problem
>
> Hi,
>
> I'm having big problem with big generated Lexer.java. Any help
> appreciated.
>
> The language is COBOL. And I found multiple reasons that the lexer
> get's too big:
>
> 1. I'm adding semantic predicate into the lexer, to simulate "lexer
> state"
> as in YACC and JavaCC. It's like
>
>        PICTURE_STRING: {lexerState==PIXTURE_STATE}?=> blah blah //
> matching things like AXX(9).99 after a 'PIC' key word
>
>    The lexer without semantic predicates is 18K lines.
>    When I add predicates to one or two of the lexer rules, it grows to
> more than 20K.
>    When I add a single more, it explodes to more than 60K and ANTLR
> give up generating lexer with error: code is too long.
>
> 2. COBOL has a LOT of key words, that may explain the original 18K
> lines.
>
> 3. I have tokens referencing other tokens.
>    I've inlined most of them now, as suggested by others. But the size
> has not reduced much.
>
> So the question could be:
> 1. how to generate smaller lexer without removing semantic predicate?
> 2. If that's not possible, how to simulate "lexer state" without
> semantic predicate?
> 3. Any other solution?
>
> Thanks.
>
> --
> Regards,
>
> Yang, Zhaohui
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address