[antlr-interest] More on "early commitment" problems with Lexer

Austin Hastings Austin_Hastings at Yahoo.com
Wed Dec 5 12:20:40 PST 2007


Howdy,


I'm trying to lex quoted strings that allow backslash-escapes, but that 
don't pre-define what the backslash escapes are going to be.

This is in contrast with the existing grammar examples, all of which 
have code like:

fragment
EscapeSequence: Slash ('\'' | 't' | 'n' | 'r' | 'b' | '"' | ...) | 
UnicodeEscape | OctalEscape | HexEscape ;

STRING_LITERAL: '"' (options {greedy=false;}: EscapeSequence | .)* '"' ;

In these examples, the list of all possible escape sequences is defined 
in advance, and it seems to work.

My version looked like this:

Qlit_Single:   Q (options {greedy=false;}: Slash Q | . )* Q ;

Where Q = single quote = '\''.

The problem was the "early commit" decision that Antlr3's lexer makes -- 
it rears its ugly head in the float vs. range issue, as well. In short, 
since Slash ('\\') is singled out, the lexer is generated such that the 
only possible sequence accepted will be Slash-Q. The '.' anychar option 
is ignored, so that \t becomes a lexer error.

I'm sure this is a design feature, and Antlr is working exactly as 
designed, and this really is the way it is supposed to be, blah blah blah.

But if anyone else needs to deal with it, I thought I'd point out the 
workaround:

Match this:

Qlit_Single: Q (options {greedy=false;}: Slash . | .)* Q;

The same premature ejaculation now works in favor of the token - 
matching Slash . means that whatever character follows a slash will be 
gobbled up, and not interfere with the matching. In my case, that means 
that Slash Q gets both the slash and the single quote into the string, 
and Slash anything-else goes in as well. Postprocessing is fairly easy.

=Austin






More information about the antlr-interest mailing list