[antlr-interest] Newbie questions about lexer
Matthew Keene
dfg778 at yahoo.com.au
Tue Jul 19 17:32:26 PDT 2005
I have just started playing with ANTLR and I'm trying
to write a grammar to parse some simple mathematical
expressions for a project that I'm working on. There
are a number of these around, including ones in the
tutorials so I used one of these as my starting point,
as shown below
lass MattLexer extends Lexer ;
options {
charVocabulary = '\0'..'\377';
//tokenVocabulary=XL3; // call the vocabulary
"Java"
//testLiterals=false; // don't automatically test
for literals
k=3; // two characters of
lookahead
}
INT : ('0'..'9')+ ;
WS
: ( ' '
| '\t'
| '\f'
)
{ $setType(Token.SKIP); }
;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
OPENBR : '(' ;
CLOSEBR : ')' ;
IDENT
options {testLiterals=true;}
: ('a'..'z'|'A'..'Z')
('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'='|'.'|' ')*
;
This worked fine, except I found a problem in that a
number of the already existing expressions use some of
the operator characters (ie '-', '+' etc) in their
names, such as this:
LP_Optimax_FOB_Clyde_T1 + (
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing-Jap_CS=VS *
Constant_0.55 ) + (
SH_Singapore_Geelong_BMAX_Clean_WSesc_Sing-Jap_CS=VS *
Constant_0.45 ) + Wharfage_Clyde Refinery
The lexer thought (quite rightly) that the identifier
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing-Jap_CS=VS
was actually an identifier called
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing, an operator
(-) and an identifier called Jap_CS=VS
The obvious answer to this is to disallow the use of
operator characters in the identifier names, but
before I made this as a rule and forced the users to
change their existing expressions, I thought that we
should be able to disambiguate this situation by
making a rule that the actual operators have to be
surrounded by whitespace. This should be easy I
thought. Unfortunately this has proven not to be the
case. My initial try was to redefine the operators
like so:
PLUS : " + " ;
MINUS : " - " ;
TIMES : " * " ;
DIV : " / " ;
OPENBR : " ( " ;
CLOSEBR : " ) " ;
This resulted in the message "line 1:25: unexpected
char: '+'", and by looking at the generated Java code,
I understand why, as the token + no longer has a match
in the case statement which parses the character
stream.
Now I have tried about a million different ways to get
this to work without success, and without boring you
with every combination I've tried I have come to the
conclusion that there's obviously something
fundamental I'm missing here. I need to allow the
characters +,- etc as valid characters in the
expression, but only to recognise them as operators
when they're surrounded by whitespace. Can somebody
please give me some pointers as to how I should
approach this, or where I can look for some easy to
follow documentation about why I'm having such
trouble, or if this is not possible explain why ?
Thanks in advance for your help
Matthew
Send instant messages to your online friends http://au.messenger.yahoo.com
More information about the antlr-interest
mailing list