[antlr-interest] lexing expression ('a'..'z')+ not matching single character input
Matt Harrison
matt at ebi.ac.uk
Thu Dec 14 06:17:23 PST 2006
Hi again,
> You have most of the single lowercase letters specified to be KEYWORDS
> of your language.
>
> Take "n", the first error you find, for example. It is a keyword as
> specified by the two rules: monosac_type_identifier and
> linkage_type_identifier. and so "n" is NOT an IDENTIFIER.
Bingo. Changing the 'n' to any letter that is not used elsewhere as a
literal
does indeed parse as expected. The heavy use of literals is the
problem then.
With the benefit of hindsight, the error message 'Expected an
IDENTIFIER but got "b"' does indeed point to 'b' having been
identified by ANTLR as something other than an IDENTIFIER, but to my
inexperienced eyes the cause was not immediately obvious.
Perhaps the error message given by ANTLR for this situation could be
modified slightly?
Instead of:
"Expected an XXX but got a 'YYY'"
... where YYY was extracted from elsewhere in the grammar and
silently made into a literal by ANTLR, an error message like:
"Expected an XXX but got literal 'YYY'"
would make it more clear that 'YYY' exists as a literal elsewhere in
the grammar and for that reason, 'YYY' will never be taken to be an
XXX (unless testLiterals is false).
> There should be lots of messages in this mailing list's archives on
> how to handle keywords which also may be identifiers...
Indeed, it seems to be a very commonly encountered problem. But what
if your "keywords" are usually single letters and highly context
dependant? I am somewhat reluctant to declare all my keywords as
IDENTIFIERS and to use semantic predicates everywhere -- I'd like the
grammar to be easily targeted to other languages besides Java and it
just feels wrong using code everywhere instead of real grammar. There
has to be a better way?
For example, given the input text "RES 1b:b-bbb-hex-1:5|6:a;", the
first 'b' is effectively a keyword, the second 'b' is another keyword
from a different set of keywords, and the 3rd-5th 'b' constitutes an
identifier. To clarify, in the position of the first 'b', any of the
characters { 'b', 's', 'i' } would be allowed, in the position of the
second 'b' { 'a', 'b', 'o', 'x' } would be allowed, whereas an
identifier can be any ('a'..'z')+.
Could anyone please recommend a strategy that efficiently deals with
this predicament?
> Hope this helps...
Indeed it has :-) Now that we've identified the problem, I'm fearful
of what the correct solution to my problem will be ;-)
cheers,
Matt
More information about the antlr-interest
mailing list