[antlr-interest] Re: Lexer question

Tue Dec 13 12:58:53 PST 2005

How about this:
Make vocabulary set large enough to accept any character.
Then create a lexer rule like this:
INVALID_CHARACTER
:     '\378'..'\FFFE'
;

btw, I am not sure if you understand antlr well. For antlr generated lexer, 
nextToken() is your "giant" rule.
Even though there are lots of public methods(each correspondes to a rule, 
good for unit testing), parser will only call nextToken().
So if you do not mind to use exception to catch invalid characters, then you 
do not have to anything special.

-- 
Xue Yong Zhi
http://seclib.blogspot.com

"Ari Steinberg" <Ari.Steinberg at EMBARCADERO.COM> 
wrote in message 
news:715057EB65FC7E47923FE408F290ADFD0104F931 at ettoma01.embarcadero.com...
Hi Guys,

Hopefully someone can help me out.  I would like my lexer to create a
special INVALID_CHARACTER token for any invalid characters it finds and
send them along to the parser so that it can be handled in the parser.

I have my char vocabulary set to '\0'..'\377' and have the filter option
set to a INVALID_CHARACTER.  This way all invalid characters ( such as
Unicode characters ) are matched by the filter rule.

Doing this I can report the character as an error but I really do need
that character to still be a part of the token stream ( rather then
ignored ).  So far the only way I've thought of to be able to accomplish
this is to make all my rules protected and have one giant rule that
matches all my subrules, this would be a major pain.  It's either that
or hack into the lexer generator and make the filter rule create and
return a token.

Any one have any better ideas?