[antlr-interest] Re: second lexical pass - yeah

Mon Apr 12 13:52:22 PDT 2004

Here is something that worked for me.
Now that I look at it I cannot rememeber why the ESC rule is like it is.
matthew
P.S. in my case  DATE and TIME were separated by white space

// Whitespace -- ignored
TAB_FORMFEED
  : ( '\t'
  | '\f'
  )
  {  _ttype = Token.SKIP; }
 ;

// Whitespace -- ignored
SPACE
  : ' '
  {  _ttype = Token.SKIP; }
 ;

NEWLINE
  : ( /* '\r' '\n' can be matched in one alternative or by matching
    '\r' in one iteration and '\n' in another.  I am trying to
    handle any flavor of newline that comes in, but the language
    that allows both "\r\n" and "\r" and "\n" to all be valid
    newline is ambiguous.  Consequently, the resulting grammar
    must be ambiguous.  I'm shutting this warning off.
    */
    options {
    generateAmbigWarnings=false;
   }
   : '\r' '\n'   // DOS
   | '\r'  // Macintosh
   | '\n' // Unix
   )  {newline();}
  { if (skipNL) {   // skip NL is skipNEWLINE() called
    _ttype = Token.SKIP;
   }
  }
;

// white space is skipped by the parser
protected
WS_SET : ( ' '
  | '\t' | '\f'
  | NEWLINE
  | ML_COMMENT
  | SL_COMMENT
  )+
  {$setType(Token.SKIP);}  // way to set token type
 ;

// Single-line comments
SL_COMMENT
 : "//"
  (~('\n'|'\r'))* NEWLINE
  {$setType(Token.SKIP);}
 ;

// multiple-line comments these are also skipped
ML_COMMENT
 : "/*"
  ( // suppress warnings about * /
   options {
    greedy = true;
   }
  : { LA(2)!='/' }? '*'
  | ~('*'|'\n'|'\r')
  | NEWLINE
  )*
  "*/"
  {$setType(Token.SKIP);}
 ;

// character literals SINGLE QUOTES around string
//CHAR_LITERAL
// : '\''! (ESC|~('\''|'\\'))* '\''!
// ;

// note must have WS between strings because "" is used for " inside string
STRING_LITERAL
  : SL_STRING_LITERAL //(WS_SET! SL_STRING_LITERAL)*
      // string concat does not work because it tries to concat anything
after
      // a string + WS_SET.
  ;

 // string literals DOUBLE QUOTES around string
 // "" => " inside double quotes.
 // can also use \"
protected
SL_STRING_LITERAL
{int i = 0;}
 : '"'! (ESC|~('"'|'\\'|'\n'))* ('"''"'! ((ESC|~('"'|'\\'|'\n')))*)*
  ('"'!
  |'\n'
   {
     if (i==0) {
      throw new TokenStreamRecognitionException(
     new RecognitionException("found newline inside string:'"+$getText+"'",
getFilename(), getLine()));
    }
   }
  )
 ;

// escape sequence -- note that this is protected; it can only be called
//   from another lexer rule -- it will not ever directly return a token to
//   the parser
// There are various ambiguities hushed in this rule.  The optional
// '0'...'9' digit matches should be matched here rather than letting
// them go back to STRING_LITERAL to be matched.  ANTLR does the
// right thing by matching immediately; hence, it's ok to shut off
// the FOLLOW ambig warnings.
protected
ESC
 : '\\'
  ( 'n' { $setText("\n");}
  | 'r' { $setText("\r");}
  | 't' { $setText("\t");}
  | 'b' { $setText("\b");}
  | 'f' { $setText("\f");}
  | '"' { $setText("\"");}
  | '\'' { $setText("\'");}
  | '\\' { $setText("\\");}
  | ('u')+ HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
  | ('0'..'3')
   (
    options {
     warnWhenFollowAmbig = false;
    }
   : ('0'..'9')
    (
     options {
      warnWhenFollowAmbig = false;
     }
    : '0'..'9'
    )?
   )?
  | ('4'..'7')
   (
    options {
     warnWhenFollowAmbig = false;
    }
   : ('0'..'9')
   )?
  )
 ;

// hexadecimal digit (again, note it's protected!)
protected
HEX
 : '0' 'x' (HEX_DIGIT)+
 ;

protected
HEX_DIGIT
 : ('0'..'9'|'a'..'f')
 ;

// an identifier.  Note that testLiterals is set to true!  This means
// that after we match the rule, we look in the literals table to see
// if it's a literal or really an identifer
IDENT
 options {testLiterals=true;
     paraphrase = "an identifier";}
 : ('a'..'z'|'_'|'$') ('a'..'z'|'_'|'0'..'9'|'$')*
 ;

protected
DIGIT
 : ('0'..'9')
 ;

// a numeric literal
protected
INT
 : (DIGIT)+
 ;

// signed int
//protected
//SIGNED_INT
// : ('+'|'-')?INT
// ;

// a numeric literal date, time, int, float, hex or oct
DATE_TIME_INT_FLOAT
 : (DIGIT DIGIT DIGIT DIGIT '/') => DATE {_ttype = DATE;}
 | (DIGIT (DIGIT)? ':') => TIME {_ttype = TIME;}
 | (INT '.') => FLOAT {_ttype = FLOAT;}
 | INT {_ttype = INT;}
 | ('0' 'x')=> HEX {_ttype = HEX;}
 ;

protected
FLOAT
 : INT '.' INT (EXPONENT)?
 ;

// signed float
//protected
//SIGNED_FLOAT
// : SIGNED_INT '.' INT (EXPONENT)?
// ;

protected
DATE
 : DIGIT DIGIT DIGIT DIGIT '/'
   DIGIT (DIGIT)? '/' DIGIT (DIGIT)?
 ;

protected
TIME
 : DIGIT (DIGIT)? ':'
   DIGIT (DIGIT)? ':'
   DIGIT (DIGIT)? ('.' INT)?
 ;

// need to add floating point
// a couple protected methods to assist in matching floating point numbers
protected
EXPONENT
 : ('e') ('+'|'-')? ('0'..'9')+
 ;

// should be in parser also duration

----- Original Message ----- 
From: "Terence Parr" <parrt at cs.usfca.edu>
To: <antlr-interest at yahoogroups.com>
Sent: Tuesday, April 13, 2004 5:11 AM
Subject: Re: [antlr-interest] Re: second lexical pass - yeah

>
> On Apr 12, 2004, at 11:50 AM, FranklinChen at cmu.edu wrote:
> > Is there a reason not to just do this:
> >
> > protected DATE: ...
> > protected STRING_LITERAL: ...
> >
> > DATE_OR_STRING_LITERAL
> >     : (DATE) => DATE { $setType(DATE); }
> >     | STRING_LITERAL { $setType(STRING_LITERAL); }
> >     ;
>
> This will be much slower due to the predicate for large input, but
> might work.
>
> > By the way, has anyone been seeing my messages to this list?  I've
> > posted three or four separate messages on different questions I've had
> > (one related to lexer issues).
>
> I'm behind in email as usual...sorry.
> Ter
> --
> Professor Comp. Sci., University of San Francisco
> Creator, ANTLR Parser Generator, http://www.antlr.org
> Cofounder, http://www.jguru.com
> Cofounder, http://www.knowspam.net enjoy email again!
> Cofounder, http://www.peerscope.com pure link sharing
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/