[antlr-interest] Number tokenizer vs. number grammar

Sun Nov 16 09:38:15 PST 2008

Thanks for that advice. I think I figured it out. I created a
tokenizer that grabs all the possible number sequences, then I have a
parser rule that, if it receives a RATIONAL or COMPLEX number, splits
it appropriately and sends each part back through the number rule to
correctly parse each part.

Here are the lexical rules, if anyone is interested.

fragment DIGIT : '0'..'9';
POS_INT : DIGIT+;
NEG_INT : '-' POS_INT;
POS_DECIMAL : DIGIT+ '.' DIGIT* | '.' DIGIT+;
NEG_DECIMAL : '-' POS_DECIMAL;
POS_RATIONAL : POS_INT '/' (POS_INT | NEG_INT);
NEG_RATIONAL : '-' POS_RATIONAL;
fragment POS_REAL : POS_DECIMAL | POS_INT | POS_RATIONAL;
fragment NEG_REAL : NEG_DECIMAL | NEG_INT | NEG_RATIONAL;

COMPLEX : (POS_REAL | NEG_REAL) '+' POS_REAL? 'i'
  | (POS_REAL | NEG_REAL) (NEG_REAL | '-') 'i'
  ;

WS : (' ' | '\t' | '\n' | '\r') { $channel = HIDDEN; }
  ;

And here's how the number rule handles a couple of examples. (I
translated to Java for the sake of being most useful to most people,
so there may be some typos.)

number returns [MyNumber value]
  : (POS_INT | NEG_INT) { $value = new MyInt($text); }
  | COMPLEX {
      int sep = Math.max($text.lastIndexOf('+'), $text.lastIndexOf('-'));
      String real = $text.substring(0, sep)
      String imag = $text.substring(sep+1)
      if (imag.equals("")) {
          imag = "1";
      }
      if ($text.charAt(sep) == '-') {
          imag = '-' + imag
      }
      Parser real_parser =
NumberParser(CommonTokenStream(numberLexer(ANTLRStringStream(real))));
      Parser imag_parser  =
NumberParser(CommonTokenStream(numberLexer(ANTLRStringStream(imag))));
      $value = new Complex(real_parser.number().getValue(),
imag_parser.number().getValue());
    }
  ;

I'm still having problems trying to include this grammar in another
grammar, but that's the subject of a forthcoming email.

Thanks again!
Todd

On Sun, Nov 16, 2008 at 1:30 AM, Gavin Lambert <antlr at mirality.co.nz> wrote:
> At 09:50 16/11/2008, Todd O'Bryan wrote:
>>Assume that that both 2 * 3+2i and 2*3+2i should lex as NUMBER OP
>>NUMBER. What does that determine about my possible approaches? :-)
>
> It implies that you're going to experience pain with "2+3+2i" (or "2/3+2i",
> for that matter, given that you've already said that this ought to be a
> single NUMBER).  :)
>
> If you can require that whitespace is significant (ie. "2 / 3+2i" is two
> NUMBERs and a division, but "2/3+2i" is a single NUMBER, and "2 /3+2i" is
> simply illegal), then probably the simplest way to deal with this (and avoid
> duplication) is to define NUMBER as any sequence with a leading digit and
> any combination of digits and operators afterwards, with no whitespace:
>
> fragment DIGIT : '0'..'9';
> NUMBER : '-'? '.'? DIGIT (DIGIT | '+' | '-' | '/' | '.' | 'i')* ;
>
> This will of course be able to match invalid constructs as well, but you can
> deal with that at the parser / tree parser / driver code level (which
> permits better error messages anyway).
>
>