[antlr-interest] Multi-word keyword matching question
Ryan Hollom
ryan.hollom at us.lawson.com
Wed Jan 3 07:00:17 PST 2007
Greetings all-
I have posted the following message a couple of times in the past, and
have not yet received an answer that describes how the antlr keyword
recognition really works. Perhaps the new year will bring forth a new
perspective on the topic.
I am writing a parser/lexer for a language with the following constructs,
and am having difficulty getting the lexer to work properly.
identifier is a Field
identifier is a LocalField
identifier is Numeric
identifier is an AddressField
So, I have the keywords 'is a Field', 'is a', 'is', and 'is an'
('LocalField' and 'AddressField' can be any name, and 'Numeric' can be one
of about 30 different things).
I've had a difficult time defining the rules for these, as I get a lexer
no viable alt error. Here is an example of my rule:
fieldDefinition :
identifier 'is a Field'
(identifier
(('is a' | 'is an') identifier) | 'is'
primitiveType)
);
primitiveType : 'Alpha' | 'Numeric';
Here is the example lexer:
NEWLINE : (('\r')? '\n' )+ ;
ID : ( 'A' .. 'Z' | '0' .. '9') ( 'A' .. 'Z' | 'a' .. 'z' | '0' ..
'9')*;
WS : (' '|'\t')+ {$channel=HIDDEN;}; <-- is this rule the
problem???
The lexer then chokes on input like:
MyField is a Field
MyNumericField is Numeric
with a no viable alt line 2:20; char='N'
The solution in the past has been to break apart the multi-word keywords
into multiple single-word keywords. This, conceptually, makes sense
(although there are added complications with keyword vs identifier
matching). However, my grammar is quite large (the generated parser is
~3Mb), and breaking apart all of the multi-word keywords increases the
size of the parser by over 200k. More disconcerting, however, is that
when I do so the parser no longer compiles as "the static initializer
exceeds 65k bytes". It also increases the max fixed k by 6, and I wonder
what it will do to performance (consider that some of the multi word
keywords are 6+ words).
It would seem to me that the lexer should match the longest keyword
possible. That is, it should first try to match 'is a Field', followed by
'is a', and finally 'is'.
Can anyone shed some light as to why antlr does not behave this way?
Thanks in advance,
Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070103/4413f994/attachment.html
More information about the antlr-interest
mailing list