[antlr-interest] Reducing tokens DFA size by splitting lexer

Fri Oct 30 09:13:57 PDT 2009

Hello listers,

I found a solution to work around an issue with my lexer where the tokens
DFA was becoming too large.

I would appreciate the list advice on this technique before I go too far in
the wrong direction.  

After reading several posts on the list, I realized 80% of the DFA volume
was generated by mixing, in the same lexer grammar, rules such as:

        EXTERNAL_KEYWORD : 'EXTERNAL';

        GLOBAL_KEYWORD   : 'GLOBAL';

        ...

        DATA_NAME        : LETTER (LETTER|'0'..'9'|'-')*;

        fragment LETTER  : 'A'..'Z'| 'a'..'z';

I have about 60 such xxx_KEYWORD. 

Of course, the situation is one described in several places where DATA_NAME
is ambiguous (any of the xxx_KEYWORD is a valid DATA_NAME) and ANTLR needs
to build a complex DFA.

I am not in the situation where keywords are valid identifiers though. I am
dealing with COBOL where keywords are reserved words that can't be used as
DATA_NAME.

My lexer works fine but it has become quite difficult to debug and I am fast
approaching some java limits.

So what I did is this:

1. Split the lexer into a primary lexer and a secondary

2. Move all keywords to the secondary lexer

3. Add an option like tokenVocab=secondary lexer; to the primary lexer

4. From this point on, the primary lexer only recognizes DATA_NAME. So add
an action in DATA_NAME like so:

        DATA_NAME

            : LETTER (LETTER|'0'..'9'|'-')*

            {

                $type = matchKeywords(getText(), $type);

            }

            ;

And this is the matchKeywords member method:

        /**

         * Asks secondary lexer to check if text is a keyword. If there is a
match,

         * makes sure the entire text was matched (as opposed to a
substring).

         * If the text matches a keyword that needs to be skipped, skip.

         * @param text the text to match with keywords

         * @param originalType the initial token type

         * @return the keyword type if a match is found otherwise the
original type

         * @throws RecognitionException if failed to call secondary lexer

         */

        public int matchKeywords(

                final String text,

                final int originalType) throws RecognitionException {

            try {

                int type = originalType;

                keywordLexer.setCharStream( new ANTLRNoCaseReaderStream(

                    new StringReader(getText())));

                CommonTokenStream kTokens = new
CommonTokenStream(keywordLexer);

                List < ? > kTokenl = kTokens.getTokens();

                if (kTokenl.size() > 0) {

                    CommonToken kToken = (CommonToken) kTokenl.get(0);

                    if (kToken.getText().length() == getText().length()) {

                        if (kToken.getType() == Token.SKIP_TOKEN.getType())
{

                            skip();

                        } else {

                            type = kToken.getType();

                        }

                    }

                }

                return type;

            } catch (IOException e) {

                throw new RecognitionException(input);

            }

        }

And that seems to work. The primary lexer DFA size + secondary lexer DFA
size is 20% of what it was when I had everything in a single lexer grammar.

For the moment the only serious drawback is that I can't get ANTLRWorks to
work with a combination of a parser grammar and 2 lexers.

It seems to me that this idea of lexers delegating to others, more
specialized lexers, is similar to AST enrichment where a chain of tree
walkers can refine an AST.

Please let me know your thoughts.

Thanks

Fady

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20091030/eb0aeb6c/attachment.html