[antlr-interest] Newbie questions about lexer

Tue Jul 19 17:32:26 PDT 2005

I have just started playing with ANTLR and I'm trying
to write a grammar to parse some simple mathematical
expressions for a project that I'm working on.  There
are a number of these around, including ones in the
tutorials so I used one of these as my starting point,
as shown below

lass MattLexer extends Lexer ;

options {
  charVocabulary = '\0'..'\377';
  //tokenVocabulary=XL3;  // call the vocabulary
"Java"
  //testLiterals=false;    // don't automatically test
for literals
  k=3;                   // two characters of
lookahead
}

INT : ('0'..'9')+ ;

WS
  : ( ' '
    | '\t'
    | '\f'
    )
    { $setType(Token.SKIP); } 
  ;

PLUS    : '+' ;
MINUS   : '-' ;
TIMES   : '*' ;
DIV     : '/' ;
OPENBR  : '(' ;
CLOSEBR : ')' ;

IDENT
  options {testLiterals=true;}
  : ('a'..'z'|'A'..'Z')
('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'='|'.'|' ')* 
  ;

This worked fine, except I found a problem in that a
number of the already existing expressions use some of
the operator characters (ie '-', '+' etc) in their
names, such as this:

LP_Optimax_FOB_Clyde_T1 + (
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing-Jap_CS=VS *
Constant_0.55 ) + (
SH_Singapore_Geelong_BMAX_Clean_WSesc_Sing-Jap_CS=VS *
Constant_0.45 ) + Wharfage_Clyde Refinery

The lexer thought (quite rightly) that the identifier
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing-Jap_CS=VS
was actually an identifier called
SH_Singapore_Sydney_BMAX_Clean_WSesc_Sing, an operator
(-) and an identifier called Jap_CS=VS

The obvious answer to this is to disallow the use of
operator characters in the identifier names, but
before I made this as a rule and forced the users to
change their existing expressions, I thought that we
should be able to disambiguate this situation by
making a rule that the actual operators have to be
surrounded by whitespace.  This should be easy I
thought.  Unfortunately this has proven not to be the
case.  My initial try was to redefine the operators
like so:

PLUS    : " + " ;
MINUS   : " - " ;
TIMES   : " * " ;
DIV     : " / " ;
OPENBR  : " ( " ;
CLOSEBR : " ) " ;

This resulted in the message "line 1:25: unexpected
char: '+'", and by looking at the generated Java code,
I understand why, as the token + no longer has a match
in the case statement which parses the character
stream.

Now I have tried about a million different ways to get
this to work without success, and without boring you
with every combination I've tried I have come to the
conclusion that there's obviously something
fundamental I'm missing here.  I need to allow the
characters +,- etc as valid characters in the
expression, but only to recognise them as operators
when they're surrounded by whitespace.  Can somebody
please give me some pointers as to how I should
approach this, or where I can look for some easy to
follow documentation about why I'm having such
trouble, or if this is not possible explain why ?

Thanks in advance for your help

Matthew

Send instant messages to your online friends http://au.messenger.yahoo.com