[antlr-interest] Ambiguous tokens

Tue Apr 3 02:31:32 PDT 2007

I've just been pondering an ambiguity problem I've been having in 
the lexer I'm working on, and an idea occurred to me.  How 
feasible would it be to add support for specifically ambiguous 
lexer tokens to ANTLR?

What I mean by that is when there are several top-level lexer 
rules that could match a given bit of text equally well, rather 
than choosing one of them and going with that, it would be helpful 
to generate a token for *all* the types.

For example, the snippet "34" could be either a Decimal or a 
HexByte, and "D2" could be a HexByte or an Identifier.  The lexer 
can't really tell; the parser needs to decide based on surrounding 
context.

Of course this could get tricky if the tokens could be different 
lengths -- I doubt that'd really work.  In this example, HexByte 
is really the only ambiguous one -- sometimes it looks like a 
Decimal, sometimes like an Identifier, but it is always exactly 
two characters long (meaning shorter or longer blocks can't be 
HexBytes), and Decimals and Identifiers themselves don't overlap.

The current workaround (I think -- I haven't actually tried it 
yet, so improvements would be appreciated) would be to generate 
yet another token for each of the intersection cases (eg. 
HexByteOrDecimal, HexByteOrIdentifier) and get the grammar to 
accept both where it normally accepts the basic token.  But it 
would be nice if there was some more automatic way of expressing 
this sort of relationship.