[antlr-interest] Why don't parsers support character ranges?

Wed Apr 23 01:16:49 PDT 2008

At 14:16 23/04/2008, Hannes Schmidt wrote:
 >Why can't I use ranges in parsers?

You can, they just don't mean what you think they mean.

 >Why doesn't ANTLR emit a warning when it ignores ranges in 
grammar
 >rules?

Because it's not ignoring them.  When you say "'a'..'z'" in a 
parser rule, the parser first automatically creates tokens for the 
quoted terms (because it's a combined grammar; if it were a 
standalone grammar you'd just get an error).  Now the rule says 
something like "T16..T17", which is a range of tokens.  If you're 
lucky, this will be the first time it's seen those tokens and 
they'll have contiguous values, so your range basically just means 
those two tokens and nothing else.  If you're not lucky, there may 
be other tokens in between those two, so you'll be referring to 
those as well.  Either way, it's probably not what you thought you 
were saying.

Given that there currently isn't any official way of controlling 
the token ids generated by rules, ranges in the parser probably 
*should* generate warnings, since (while valid) they're not 
especially useful.

(I also dislike the way that quoted constants are permitted in 
parser rules in the first place [since I think it leads to just 
this sort of confusion], but that's a different issue.)

 >How can I emulate the missing range feature without obfuscating
 >my grammar too much? Semantic predicates?

Probably.  Or move whatever construct you're trying to match that 
includes ranges into the lexer as a single token.

 >Now let me put my tinfoil hat on and theorize a little bit: I
 >think that the root cause of  my confusion is ANTLR's 
distinction
 >between lexer and parser. I think this distinction is purely
 >historical and ANTLR might be better of without it. When writing 

 >grammars, I often find myself in situations where I know that
 >certain lexer rules make sense in a certain parser context only
 >but that context is not available to the lexer because the state 

 >that defines it is maintained in the parser.

At times I agree with you, but it's usually not all that hard to 
get a decent set of lexer rules.  The tactic I usually follow is 
to write the lexer rules *first*, and unit test them by themselves 
to ensure the token stream is being generated as I expect.  *Then* 
I start writing parser rules to either transform the token stream 
into an AST or to directly do something more interesting.  If you 
think of it in layers then it's not hard to keep it all straight.