[antlr-interest] Re: Lexer question

Ari Steinberg Ari.Steinberg at EMBARCADERO.COM
Wed Dec 14 07:42:35 PST 2005


Funny how sometimes the simplest solutions can be overlooked.  Initially
I was trying to avoid specifying a large vocabulary set but what you
suggested works perfectly.

I know that the lexer creates the nextToken() method but what I was
looking for was to specify a rule that would always be the last else in
it.  The filter rule fit this purpose but it doesn't allow you to return
any tokens( which I need to do ).

Anyway, thanks for the suggestion.  It's helped.
Ari

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Xue Yong Zhi
Sent: Tuesday, December 13, 2005 3:59 PM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Re: Lexer question

How about this:
Make vocabulary set large enough to accept any character.
Then create a lexer rule like this:
INVALID_CHARACTER
:     '\378'..'\FFFE'
;

btw, I am not sure if you understand antlr well. For antlr generated
lexer, 
nextToken() is your "giant" rule.
Even though there are lots of public methods(each correspondes to a
rule, 
good for unit testing), parser will only call nextToken().
So if you do not mind to use exception to catch invalid characters, then
you 
do not have to anything special.

-- 
Xue Yong Zhi
http://seclib.blogspot.com


"Ari Steinberg" <Ari.Steinberg at EMBARCADERO.COM> 
wrote in message 
news:715057EB65FC7E47923FE408F290ADFD0104F931 at ettoma01.embarcadero.com..
.
Hi Guys,

Hopefully someone can help me out.  I would like my lexer to create a
special INVALID_CHARACTER token for any invalid characters it finds and
send them along to the parser so that it can be handled in the parser.

I have my char vocabulary set to '\0'..'\377' and have the filter option
set to a INVALID_CHARACTER.  This way all invalid characters ( such as
Unicode characters ) are matched by the filter rule.

Doing this I can report the character as an error but I really do need
that character to still be a part of the token stream ( rather then
ignored ).  So far the only way I've thought of to be able to accomplish
this is to make all my rules protected and have one giant rule that
matches all my subrules, this would be a major pain.  It's either that
or hack into the lexer generator and make the filter rule create and
return a token.

Any one have any better ideas?






More information about the antlr-interest mailing list