[antlr-interest] Ambiguous tokens
Gavin Lambert
antlr at mirality.co.nz
Tue Apr 3 02:31:32 PDT 2007
I've just been pondering an ambiguity problem I've been having in
the lexer I'm working on, and an idea occurred to me. How
feasible would it be to add support for specifically ambiguous
lexer tokens to ANTLR?
What I mean by that is when there are several top-level lexer
rules that could match a given bit of text equally well, rather
than choosing one of them and going with that, it would be helpful
to generate a token for *all* the types.
For example, the snippet "34" could be either a Decimal or a
HexByte, and "D2" could be a HexByte or an Identifier. The lexer
can't really tell; the parser needs to decide based on surrounding
context.
Of course this could get tricky if the tokens could be different
lengths -- I doubt that'd really work. In this example, HexByte
is really the only ambiguous one -- sometimes it looks like a
Decimal, sometimes like an Identifier, but it is always exactly
two characters long (meaning shorter or longer blocks can't be
HexBytes), and Decimals and Identifiers themselves don't overlap.
The current workaround (I think -- I haven't actually tried it
yet, so improvements would be appreciated) would be to generate
yet another token for each of the intersection cases (eg.
HexByteOrDecimal, HexByteOrIdentifier) and get the grammar to
accept both where it normally accepts the basic token. But it
would be nice if there was some more automatic way of expressing
this sort of relationship.
More information about the antlr-interest
mailing list