[antlr-interest] Multi-word keyword matching question

Wed Jan 3 07:00:17 PST 2007

Greetings all-

I have posted the following message a couple of times in the past, and 
have not yet received an answer that describes how the antlr keyword 
recognition really works.  Perhaps the new year will bring forth a new 
perspective on the topic.

I am writing a parser/lexer for a language with the following constructs, 
and am having difficulty getting the lexer to work properly. 

identifier is a Field 
        identifier is a LocalField 
        identifier is Numeric 
        identifier is an AddressField 

So, I have the keywords 'is a Field', 'is a', 'is', and 'is an' 
('LocalField' and 'AddressField' can be any name, and 'Numeric' can be one 
of about 30 different things). 

I've had a difficult time defining the rules for these, as I get a lexer 
no viable alt error.  Here is an example of my rule:
fieldDefinition : 
        identifier 'is a Field' 
                (identifier  
                        (('is a' | 'is an') identifier) | 'is' 
primitiveType)
                ); 

primitiveType : 'Alpha' | 'Numeric'; 

Here is the example lexer:
NEWLINE :   (('\r')? '\n' )+ ; 
ID        : ( 'A' .. 'Z' | '0' .. '9') ( 'A' .. 'Z' | 'a' .. 'z' | '0' .. 
'9')*; 
WS        :        (' '|'\t')+ {$channel=HIDDEN;}; <-- is this rule the 
problem???

The lexer then chokes on input like: 
MyField is a Field 
        MyNumericField is Numeric 

with a no viable alt line 2:20; char='N' 

The solution in the past has been to break apart the multi-word keywords 
into multiple single-word keywords.  This, conceptually, makes sense 
(although there are added complications with keyword vs identifier 
matching).  However, my grammar is quite large (the generated parser is 
~3Mb), and breaking apart all of the multi-word keywords increases the 
size of the parser by over 200k.  More disconcerting, however, is that 
when I do so the parser no longer compiles as "the static initializer 
exceeds 65k bytes".  It also increases the max fixed k by 6, and I wonder 
what it will do to performance (consider that some of the multi word 
keywords are 6+ words).

It would seem to me that the lexer should match the longest keyword 
possible.  That is, it should first try to match 'is a Field', followed by 
'is a', and finally 'is'.

Can anyone shed some light as to why antlr does not behave this way?

Thanks in advance, 
Ryan 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070103/4413f994/attachment.html