[antlr-interest] adding Unicode identifiers confuses grammar

David J. Biesack David.Biesack at sas.com
Mon Sep 14 08:36:59 PDT 2009


I'm working on a grammar for an AMPL-like language (see an extracted simplified
version below). It works fine (ANTLR 3.1.3) when I use the following token
definition for identifiers:

ID
  :
  ('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
  ;

but when I copy the token fragments for Unicode identifiers from 
http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
and change my ID rule to use them:

ID 
  :
  IdentifierStart IdentifierPart*
  ;

I get many warnings and disabled tokens, and an error. Here are some (full errors listed below):

    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'o'": OR, ORDERED, ID

    As a result, token(s) ORDERED,ID were disabled for that input
    ...
    warning(209): AMPL.g:75:1: Multiple token rules can match input such as "':'": ASSIGN, COLON

    As a result, token(s) COLON were disabled for that input
    ...
    error(208): AMPL.g:852:1: The following token definitions can never be matched because prior tokens match the same input: BINARY,CIRCULAR,CROSS,DEFAULT,DIFF,DIMENSION,ELSE,INT,INTEGER,INTER,INTERVAL,LIST,LONG,MAXIMIZE,MIN,MINIMIZE,ORDERED,PROD,SUBJECT,SYMBOLIC,SYMDIFF,SUM,TO,WHERE,WITHIN,LE,GE,COLON,MultiLineComment

For example, colon ':' is '\u003a' which is not in IdentifierStart, so I don't know how
that causes an ambiguity with ASSIGN or COLON tokens (':=' and ':') 

(not all the tokens in the grammar below are used in this excerpt; they are used
in the longer grammar, though)

I cannot find how to resolve this. Any help?

Here is the initial grammar that works:

grammar AMPL;

tokens {

   AND                       = 'and';
   BY                        = 'by';
   BINARY                    = 'binary';
   CHAR                      = 'char';
   CIRCULAR                  = 'circular';
   CROSS                     = 'cross';
   DATA                      = 'data';
   DEFAULT                   = 'default';
   DIFF                      = 'diff';
   DIMENSION                 = 'dimension';
   END                       = 'end';
   ELSE                      = 'else';
   FLOAT                     = 'float';
   IN                        = 'in';
   INT                       = 'int';
   INTEGER                   = 'integer';
   INTER                     = 'inter';
   INTERVAL                  = 'interval';
   LET                       = 'let';
   LIST                      = 'list';
   LONG                      = 'long';
   MAX                       = 'max';
   MAXIMIZE                  = 'maximize';
   MIN                       = 'min';
   MINIMIZE                  = 'minimize';
   NOT                       = 'not';
   OR                        = 'or';
   ORDERED                   = 'ordered';
   PARAM                     = 'param';
   PROD                      = 'prod';
   REVERSED                  = 'reversed';
   SET                       = 'set';
   SUBJECT                   = 'subject';
   SYMBOLIC                  = 'symbolic';
   SYMDIFF                   = 'symdiff';
   SUM                       = 'sum';
   THEN                      = 'then';
   TO                        = 'to';
   UNION                     = 'union';
   WHEN                      = 'when';
   WHERE                     = 'where';
   WITHIN                    = 'within';
   VAR                       = 'var';
   XOR                       = 'xor';

   LBRACE                    = '{';
   RBRACE                    = '}';
   LPAREN                    = '(';
   RPAREN                    = ')';
   LBRACKET                  = '[';
   RBRACKET                  = ']';
   DQUOTE                    = '\"';
   SQUOTE                    = '\'';
   COMMA                     = ',';
   SEMI                      = ';';
   TIMES                     = '*';
   MDOT                      = '·';
   DIVIDE                    = '/';
   RANGE                     = '..';
   ASSIGN                    = ':=';
   EQ                        = '=';
   NE                        = '!=';
   LT                        = '<';
   GT                        = '>';
   LE                        = '<=';
   GE                        = '>=';
   CONCAT                    = '||';
   PLUS                      = '+';
   MINUS                     = '-';
   COLON                     = ':';

}


@parser::header {
package com.sas.test.antlr;
}

@lexer::header {
package com.sas.test.antlr;
}

formulation
  :
  declaration *
  ;

declaration
  :
  ( set_binding
  | var_binding
  | param_binding
  )
  ;

set_binding
  : SET identifier SEMI
  ;

var_binding
  : VAR identifier var_attributes? SEMI
  ;

var_attributes
  : var_attribute (COMMA var_attributes )?
  ;

var_attribute
  :
  ( INTEGER | BINARY )
  ;

param_binding
  : PARAM identifier param_attributes? SEMI
  ;

param_attributes
  : param_attribute (COMMA param_attributes)?
  ;

param_attribute
  :
  ( INTEGER | BINARY | SYMBOLIC )
  ;

identifier
  :
  ID
  ;

// =============== Lexical Rules ===================

/**
 * A simple identifier such as A or x or stock
 */
ID         // options
  :
  // IdentifierStart IdentifierPart*
  ('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
  ;

// fragment
// IdentifierStart
//     :   .... // see http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
//     ;
//
// fragment
// IdentifierPart
//     :   ... // see http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g

/**
 * Whitespace
 */
WS
  :
  ( ' '
  | '\t'
  | '\f'
  | ('\n'|'\r'('\n'))
  )+   { $channel=HIDDEN; }

  ;

SingleLineComment
  : ('//' | '#') (~('\n'|'\r'))* ('\n'|'\r'('\n')?)?   { $channel=HIDDEN; }
  ;

MultiLineComment
 : '/*' ( options {greedy=false;} : . )* '*/'   { $channel=HIDDEN; }
 ;


and here are all the ANTR errors I get from 3.1.3 :

    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'b'": BY, BINARY, ID

    As a result, token(s) BINARY,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'x'": XOR, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'r'": REVERSED, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:70:1: Multiple token rules can match input such as "'<'": LT, LE

    As a result, token(s) LE were disabled for that input
    warning(209): AMPL.g:852:1: Multiple token rules can match input such as "'/'": DIVIDE, SingleLineComment, MultiLineComment

    As a result, token(s) SingleLineComment,MultiLineComment were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'u'": UNION, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'l'": LET, LIST, LONG, ID

    As a result, token(s) LIST,LONG,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'i'": IN, INT, INTEGER, INTER, INTERVAL, ID

    As a result, token(s) INT,INTEGER,INTER,INTERVAL,ID were disabled for that input
    warning(209): AMPL.g:71:1: Multiple token rules can match input such as "'>'": GT, GE

    As a result, token(s) GE were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'v'": VAR, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'d'": DATA, DEFAULT, DIFF, DIMENSION, ID

    As a result, token(s) DEFAULT,DIFF,DIMENSION,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'m'": MAX, MAXIMIZE, MIN, MINIMIZE, ID

    As a result, token(s) MAXIMIZE,MIN,MINIMIZE,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'e'": END, ELSE, ID

    As a result, token(s) ELSE,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'o'": OR, ORDERED, ID

    As a result, token(s) ORDERED,ID were disabled for that input
    warning(209): AMPL.g:75:1: Multiple token rules can match input such as "':'": ASSIGN, COLON

    As a result, token(s) COLON were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'c'": CHAR, CIRCULAR, CROSS, ID

    As a result, token(s) CIRCULAR,CROSS,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'p'": PARAM, PROD, ID

    As a result, token(s) PROD,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'w'": WHEN, WHERE, WITHIN, ID

    As a result, token(s) WHERE,WITHIN,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'a'": AND, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'n'": NOT, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'f'": FLOAT, ID

    As a result, token(s) ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'t'": THEN, TO, ID

    As a result, token(s) TO,ID were disabled for that input
    warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'s'": SET, SUBJECT, SYMBOLIC, SYMDIFF, SUM, ID

    As a result, token(s) SUBJECT,SYMBOLIC,SYMDIFF,SUM,ID were disabled for that input
    error(208): AMPL.g:852:1: The following token definitions can never be matched because prior tokens match the same input: BINARY,CIRCULAR,CROSS,DEFAULT,DIFF,DIMENSION,ELSE,INT,INTEGER,INTER,INTERVAL,LIST,LONG,MAXIMIZE,MIN,MINIMIZE,ORDERED,PROD,SUBJECT,SYMBOLIC,SYMDIFF,SUM,TO,WHERE,WITHIN,LE,GE,COLON,MultiLineComment

-- 
David J. Biesack, SAS
SAS Campus Dr. Cary, NC 27513
www.sas.com    (919) 531-7771


More information about the antlr-interest mailing list