[antlr-interest] Possible general solution to problem of keywords as identifiers

Thu Sep 4 15:21:30 PDT 2008

Hi,

I've been pondering the problem of languages (such as SQL) where 
keywords are not reserved, and so can be used as identifiers. It seems 
to keep coming up, and the current solutions (such as the two described 
at http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741) are 
ugly, and create maintenance issues (yeah, I know there are ways to work 
around the maintenance problem, but it's still ugly). I think I've come 
up with a general solution, or at least one key piece of one. But maybe 
I'm missing something that would make it unworkable, so I thought I'd 
run it by the experts.

My idea is to override the match(IntStream input, int ttype, BitSet 
follow) method in BaseRecognizer so that it checks for identifiers that 
should be treated keywords, or vice versa (depending on whether the 
lexer would be returning keywords as identifiers or identifiers as 
keywords). It would have to know (through some back channel like a 
constant definition) which token type represented identifiers.

If the lexer were written to always return identifiers, not keywords 
(which makes for a simpler and possibly faster lexer), the keywords 
would have to be separately defined as token types. Then match could be 
modified so that when the input token is an identifier, and the token 
type it is to match against is a keyword, then it would check for string 
equality and treat it as a match if equal. That is, the condition:

    if ( input.LA(1)==ttype ) {

would become something like:

    if ( input.LA(1)==ttype || (input.LA(1)==IDENTIFIER_TTYPE && 
input.LT(1).equals(tokenText(ttype)))) {

where "tokenText(ttype)" is some method for looking up the token text 
corresponding to the integer token type (not sure if there's already a 
way to do this).

Conversely, if the lexer always gave preference to returning keywords, 
then match would be modified to check for keywords that should be 
treated as identifiers, something like this:

    if ( input.LA(1)==ttype || (ttype==IDENTIFIER_TTYPE && 
isKeywordType(input.LA(1)))) {

where "isKeywordType(input.LA(1)" somehow checks (perhaps with a hash 
table) that the input token type is a keyword that could be interpreted 
as an identifier. This approach might suffer from the same maintenance 
issues that the current approaches do.

I'm more confident that the first approach would work correctly. In 
particular, that it could correctly handle situations where an 
identifier or one or more keywords are legal at the same point. 
Basically, it should recognize any legal keywords preferentially (as 
this is generally how such languages must resolve such ambiguities), and 
only treat it as an identifier otherwise. I guess this would require 
that all keywords be checked as alternatives before checking for an 
identifier, which perhaps adds a certain amount of fragility to the 
grammars.

It's too bad the parser doesn't have the first sets available at run 
time, as it does the follow sets - these could be used to check for 
valid keywords at that point, before treating it as an identifier. This 
would eliminate the checking-order constraint, making it even more general.

If this approach is workable, I could even see adding support for it as 
an option to Antlr.

Let me know if you see some fatal flaw in this approach that I haven't seen.

Thanks,
Ron

-- 
Ron Hunter-Duvar | Software Developer V | 403-272-6580
Oracle Service Engineering
Gulf Canada Square 401 - 9th Avenue S.W., Calgary, AB, Canada T2P 3C5

All opinions expressed here are mine, and do not necessarily represent
those of my employer.