[antlr-interest] lexing expression ('a'..'z')+ not matching single character input

Thu Dec 14 06:17:23 PST 2006

Hi again,

> You have most of the single lowercase letters specified to be KEYWORDS
> of your language.
>
> Take "n", the first error you find, for example. It is a keyword as
> specified by the two rules: monosac_type_identifier and
> linkage_type_identifier.  and so "n" is NOT an IDENTIFIER.

Bingo. Changing the 'n' to any letter that is not used elsewhere as a  
literal
does indeed parse as expected. The heavy use of literals is the  
problem then.

With the benefit of hindsight, the error message 'Expected an  
IDENTIFIER but got "b"' does indeed point to 'b' having been  
identified by ANTLR as something other than an IDENTIFIER, but to my  
inexperienced eyes the cause was not immediately obvious.

Perhaps the error message given by ANTLR for this situation could be  
modified slightly?

Instead of:

"Expected an XXX but got a 'YYY'"

... where YYY was extracted from elsewhere in the grammar and  
silently made into a literal by ANTLR, an error message like:

"Expected an XXX but got literal 'YYY'"

would make it more clear that 'YYY' exists as a literal elsewhere in  
the grammar and for that reason, 'YYY' will never be taken to be an  
XXX (unless testLiterals is false).

> There should be lots of messages in this mailing list's archives on
> how to handle keywords which also may be identifiers...

Indeed, it seems to be a very commonly encountered problem. But what  
if your "keywords" are usually single letters and highly context  
dependant? I am somewhat reluctant to declare all my keywords as  
IDENTIFIERS and to use semantic predicates everywhere -- I'd like the  
grammar to be easily targeted to other languages besides Java and it  
just feels wrong using code everywhere instead of real grammar. There  
has to be a better way?

For example, given the input text "RES 1b:b-bbb-hex-1:5|6:a;", the  
first 'b' is effectively a keyword, the second 'b' is another keyword  
from a different set of keywords, and the 3rd-5th 'b' constitutes an  
identifier. To clarify, in the position of the first 'b', any of the  
characters { 'b', 's', 'i' } would be allowed, in the position of the  
second 'b' { 'a', 'b', 'o', 'x' } would be allowed, whereas an  
identifier can be any ('a'..'z')+.

Could anyone please recommend a strategy that efficiently deals with  
this predicament?

> Hope this helps...

Indeed it has :-) Now that we've identified the problem, I'm fearful  
of what the correct solution to my problem will be ;-)

cheers,
Matt