[antlr-interest] Reducing tokens DFA size by splitting lexer
Fady Moussallam
fady at legsem.com
Fri Oct 30 09:13:57 PDT 2009
Hello listers,
I found a solution to work around an issue with my lexer where the tokens
DFA was becoming too large.
I would appreciate the list advice on this technique before I go too far in
the wrong direction.
After reading several posts on the list, I realized 80% of the DFA volume
was generated by mixing, in the same lexer grammar, rules such as:
EXTERNAL_KEYWORD : 'EXTERNAL';
GLOBAL_KEYWORD : 'GLOBAL';
...
DATA_NAME : LETTER (LETTER|'0'..'9'|'-')*;
fragment LETTER : 'A'..'Z'| 'a'..'z';
I have about 60 such xxx_KEYWORD.
Of course, the situation is one described in several places where DATA_NAME
is ambiguous (any of the xxx_KEYWORD is a valid DATA_NAME) and ANTLR needs
to build a complex DFA.
I am not in the situation where keywords are valid identifiers though. I am
dealing with COBOL where keywords are reserved words that can't be used as
DATA_NAME.
My lexer works fine but it has become quite difficult to debug and I am fast
approaching some java limits.
So what I did is this:
1. Split the lexer into a primary lexer and a secondary
2. Move all keywords to the secondary lexer
3. Add an option like tokenVocab=secondary lexer; to the primary lexer
4. From this point on, the primary lexer only recognizes DATA_NAME. So add
an action in DATA_NAME like so:
DATA_NAME
: LETTER (LETTER|'0'..'9'|'-')*
{
$type = matchKeywords(getText(), $type);
}
;
And this is the matchKeywords member method:
/**
* Asks secondary lexer to check if text is a keyword. If there is a
match,
* makes sure the entire text was matched (as opposed to a
substring).
* If the text matches a keyword that needs to be skipped, skip.
* @param text the text to match with keywords
* @param originalType the initial token type
* @return the keyword type if a match is found otherwise the
original type
* @throws RecognitionException if failed to call secondary lexer
*/
public int matchKeywords(
final String text,
final int originalType) throws RecognitionException {
try {
int type = originalType;
keywordLexer.setCharStream( new ANTLRNoCaseReaderStream(
new StringReader(getText())));
CommonTokenStream kTokens = new
CommonTokenStream(keywordLexer);
List < ? > kTokenl = kTokens.getTokens();
if (kTokenl.size() > 0) {
CommonToken kToken = (CommonToken) kTokenl.get(0);
if (kToken.getText().length() == getText().length()) {
if (kToken.getType() == Token.SKIP_TOKEN.getType())
{
skip();
} else {
type = kToken.getType();
}
}
}
return type;
} catch (IOException e) {
throw new RecognitionException(input);
}
}
And that seems to work. The primary lexer DFA size + secondary lexer DFA
size is 20% of what it was when I had everything in a single lexer grammar.
For the moment the only serious drawback is that I can't get ANTLRWorks to
work with a combination of a parser grammar and 2 lexers.
It seems to me that this idea of lexers delegating to others, more
specialized lexers, is similar to AST enrichment where a chain of tree
walkers can refine an AST.
Please let me know your thoughts.
Thanks
Fady
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20091030/eb0aeb6c/attachment.html
More information about the antlr-interest
mailing list