[antlr-interest] Examining characters in lexer

Jim Idle jimi at temporal-wave.com
Thu Mar 12 14:01:13 PDT 2009


Dennis Brothers wrote:
> Is there a special symbol or method that returns the character about  
> to be scanned? 
input.LA(1)
input.LA(2)

etc.
>  In order to handle a variety of (natural) languages,  
> I'd like to use Unicode categories to detect various character types  
> (particularly letters).
>
> I want to do something like
>
> fragment LETTER : { Char.IsLetter( $char ) } ?=> . ;
>
> where $char is the next character to be scanned and Char.IsLetter() is  
> a .NET method that examines a character's Unicode category and returns  
> true if it's one of the letter categories.
>
> While I'm at it, is it legal to use a gated predicate like the above  
> in a lexer?
>   
Yes, but you might find you need to finesse things so you don't create 
issues such as rules that never match and so on.

It is fine to code the ranges in ANTLR, but you can end up with some big 
lexers.

However, overall, you don't want the lexer to fail, so it is better to 
accept things taht are not ataully valid, but then check the validity in 
a routine that can say "Character xx is not a valid identifier 
character", as otherwise you just get

Illegal character: xxx

and that does not have enough context for a user.

Jim




More information about the antlr-interest mailing list