[antlr-interest] Examining characters in lexer
Jim Idle
jimi at temporal-wave.com
Thu Mar 12 14:01:13 PDT 2009
Dennis Brothers wrote:
> Is there a special symbol or method that returns the character about
> to be scanned?
input.LA(1)
input.LA(2)
etc.
> In order to handle a variety of (natural) languages,
> I'd like to use Unicode categories to detect various character types
> (particularly letters).
>
> I want to do something like
>
> fragment LETTER : { Char.IsLetter( $char ) } ?=> . ;
>
> where $char is the next character to be scanned and Char.IsLetter() is
> a .NET method that examines a character's Unicode category and returns
> true if it's one of the letter categories.
>
> While I'm at it, is it legal to use a gated predicate like the above
> in a lexer?
>
Yes, but you might find you need to finesse things so you don't create
issues such as rules that never match and so on.
It is fine to code the ranges in ANTLR, but you can end up with some big
lexers.
However, overall, you don't want the lexer to fail, so it is better to
accept things taht are not ataully valid, but then check the validity in
a routine that can say "Character xx is not a valid identifier
character", as otherwise you just get
Illegal character: xxx
and that does not have enough context for a user.
Jim
More information about the antlr-interest
mailing list