[antlr-interest] Possible general solution to problem of keywords as identifiers
Ron Hunter-Duvar
ron.hunter-duvar at oracle.com
Thu Sep 4 15:21:30 PDT 2008
Hi,
I've been pondering the problem of languages (such as SQL) where
keywords are not reserved, and so can be used as identifiers. It seems
to keep coming up, and the current solutions (such as the two described
at http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741) are
ugly, and create maintenance issues (yeah, I know there are ways to work
around the maintenance problem, but it's still ugly). I think I've come
up with a general solution, or at least one key piece of one. But maybe
I'm missing something that would make it unworkable, so I thought I'd
run it by the experts.
My idea is to override the match(IntStream input, int ttype, BitSet
follow) method in BaseRecognizer so that it checks for identifiers that
should be treated keywords, or vice versa (depending on whether the
lexer would be returning keywords as identifiers or identifiers as
keywords). It would have to know (through some back channel like a
constant definition) which token type represented identifiers.
If the lexer were written to always return identifiers, not keywords
(which makes for a simpler and possibly faster lexer), the keywords
would have to be separately defined as token types. Then match could be
modified so that when the input token is an identifier, and the token
type it is to match against is a keyword, then it would check for string
equality and treat it as a match if equal. That is, the condition:
if ( input.LA(1)==ttype ) {
would become something like:
if ( input.LA(1)==ttype || (input.LA(1)==IDENTIFIER_TTYPE &&
input.LT(1).equals(tokenText(ttype)))) {
where "tokenText(ttype)" is some method for looking up the token text
corresponding to the integer token type (not sure if there's already a
way to do this).
Conversely, if the lexer always gave preference to returning keywords,
then match would be modified to check for keywords that should be
treated as identifiers, something like this:
if ( input.LA(1)==ttype || (ttype==IDENTIFIER_TTYPE &&
isKeywordType(input.LA(1)))) {
where "isKeywordType(input.LA(1)" somehow checks (perhaps with a hash
table) that the input token type is a keyword that could be interpreted
as an identifier. This approach might suffer from the same maintenance
issues that the current approaches do.
I'm more confident that the first approach would work correctly. In
particular, that it could correctly handle situations where an
identifier or one or more keywords are legal at the same point.
Basically, it should recognize any legal keywords preferentially (as
this is generally how such languages must resolve such ambiguities), and
only treat it as an identifier otherwise. I guess this would require
that all keywords be checked as alternatives before checking for an
identifier, which perhaps adds a certain amount of fragility to the
grammars.
It's too bad the parser doesn't have the first sets available at run
time, as it does the follow sets - these could be used to check for
valid keywords at that point, before treating it as an identifier. This
would eliminate the checking-order constraint, making it even more general.
If this approach is workable, I could even see adding support for it as
an option to Antlr.
Let me know if you see some fatal flaw in this approach that I haven't seen.
Thanks,
Ron
--
Ron Hunter-Duvar | Software Developer V | 403-272-6580
Oracle Service Engineering
Gulf Canada Square 401 - 9th Avenue S.W., Calgary, AB, Canada T2P 3C5
All opinions expressed here are mine, and do not necessarily represent
those of my employer.
More information about the antlr-interest
mailing list