[antlr-interest] adding Unicode identifiers confuses grammar
David J. Biesack
David.Biesack at sas.com
Mon Sep 14 08:36:59 PDT 2009
I'm working on a grammar for an AMPL-like language (see an extracted simplified
version below). It works fine (ANTLR 3.1.3) when I use the following token
definition for identifiers:
ID
:
('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
;
but when I copy the token fragments for Unicode identifiers from
http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
and change my ID rule to use them:
ID
:
IdentifierStart IdentifierPart*
;
I get many warnings and disabled tokens, and an error. Here are some (full errors listed below):
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'o'": OR, ORDERED, ID
As a result, token(s) ORDERED,ID were disabled for that input
...
warning(209): AMPL.g:75:1: Multiple token rules can match input such as "':'": ASSIGN, COLON
As a result, token(s) COLON were disabled for that input
...
error(208): AMPL.g:852:1: The following token definitions can never be matched because prior tokens match the same input: BINARY,CIRCULAR,CROSS,DEFAULT,DIFF,DIMENSION,ELSE,INT,INTEGER,INTER,INTERVAL,LIST,LONG,MAXIMIZE,MIN,MINIMIZE,ORDERED,PROD,SUBJECT,SYMBOLIC,SYMDIFF,SUM,TO,WHERE,WITHIN,LE,GE,COLON,MultiLineComment
For example, colon ':' is '\u003a' which is not in IdentifierStart, so I don't know how
that causes an ambiguity with ASSIGN or COLON tokens (':=' and ':')
(not all the tokens in the grammar below are used in this excerpt; they are used
in the longer grammar, though)
I cannot find how to resolve this. Any help?
Here is the initial grammar that works:
grammar AMPL;
tokens {
AND = 'and';
BY = 'by';
BINARY = 'binary';
CHAR = 'char';
CIRCULAR = 'circular';
CROSS = 'cross';
DATA = 'data';
DEFAULT = 'default';
DIFF = 'diff';
DIMENSION = 'dimension';
END = 'end';
ELSE = 'else';
FLOAT = 'float';
IN = 'in';
INT = 'int';
INTEGER = 'integer';
INTER = 'inter';
INTERVAL = 'interval';
LET = 'let';
LIST = 'list';
LONG = 'long';
MAX = 'max';
MAXIMIZE = 'maximize';
MIN = 'min';
MINIMIZE = 'minimize';
NOT = 'not';
OR = 'or';
ORDERED = 'ordered';
PARAM = 'param';
PROD = 'prod';
REVERSED = 'reversed';
SET = 'set';
SUBJECT = 'subject';
SYMBOLIC = 'symbolic';
SYMDIFF = 'symdiff';
SUM = 'sum';
THEN = 'then';
TO = 'to';
UNION = 'union';
WHEN = 'when';
WHERE = 'where';
WITHIN = 'within';
VAR = 'var';
XOR = 'xor';
LBRACE = '{';
RBRACE = '}';
LPAREN = '(';
RPAREN = ')';
LBRACKET = '[';
RBRACKET = ']';
DQUOTE = '\"';
SQUOTE = '\'';
COMMA = ',';
SEMI = ';';
TIMES = '*';
MDOT = '·';
DIVIDE = '/';
RANGE = '..';
ASSIGN = ':=';
EQ = '=';
NE = '!=';
LT = '<';
GT = '>';
LE = '<=';
GE = '>=';
CONCAT = '||';
PLUS = '+';
MINUS = '-';
COLON = ':';
}
@parser::header {
package com.sas.test.antlr;
}
@lexer::header {
package com.sas.test.antlr;
}
formulation
:
declaration *
;
declaration
:
( set_binding
| var_binding
| param_binding
)
;
set_binding
: SET identifier SEMI
;
var_binding
: VAR identifier var_attributes? SEMI
;
var_attributes
: var_attribute (COMMA var_attributes )?
;
var_attribute
:
( INTEGER | BINARY )
;
param_binding
: PARAM identifier param_attributes? SEMI
;
param_attributes
: param_attribute (COMMA param_attributes)?
;
param_attribute
:
( INTEGER | BINARY | SYMBOLIC )
;
identifier
:
ID
;
// =============== Lexical Rules ===================
/**
* A simple identifier such as A or x or stock
*/
ID // options
:
// IdentifierStart IdentifierPart*
('a'..'z'|'A'..'Z'|'_'|'$') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'$')*
;
// fragment
// IdentifierStart
// : .... // see http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
// ;
//
// fragment
// IdentifierPart
// : ... // see http://openjdk.java.net/projects/compiler-grammar/antlrworks/Java.g
/**
* Whitespace
*/
WS
:
( ' '
| '\t'
| '\f'
| ('\n'|'\r'('\n'))
)+ { $channel=HIDDEN; }
;
SingleLineComment
: ('//' | '#') (~('\n'|'\r'))* ('\n'|'\r'('\n')?)? { $channel=HIDDEN; }
;
MultiLineComment
: '/*' ( options {greedy=false;} : . )* '*/' { $channel=HIDDEN; }
;
and here are all the ANTR errors I get from 3.1.3 :
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'b'": BY, BINARY, ID
As a result, token(s) BINARY,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'x'": XOR, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'r'": REVERSED, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:70:1: Multiple token rules can match input such as "'<'": LT, LE
As a result, token(s) LE were disabled for that input
warning(209): AMPL.g:852:1: Multiple token rules can match input such as "'/'": DIVIDE, SingleLineComment, MultiLineComment
As a result, token(s) SingleLineComment,MultiLineComment were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'u'": UNION, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'l'": LET, LIST, LONG, ID
As a result, token(s) LIST,LONG,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'i'": IN, INT, INTEGER, INTER, INTERVAL, ID
As a result, token(s) INT,INTEGER,INTER,INTERVAL,ID were disabled for that input
warning(209): AMPL.g:71:1: Multiple token rules can match input such as "'>'": GT, GE
As a result, token(s) GE were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'v'": VAR, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'d'": DATA, DEFAULT, DIFF, DIMENSION, ID
As a result, token(s) DEFAULT,DIFF,DIMENSION,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'m'": MAX, MAXIMIZE, MIN, MINIMIZE, ID
As a result, token(s) MAXIMIZE,MIN,MINIMIZE,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'e'": END, ELSE, ID
As a result, token(s) ELSE,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'o'": OR, ORDERED, ID
As a result, token(s) ORDERED,ID were disabled for that input
warning(209): AMPL.g:75:1: Multiple token rules can match input such as "':'": ASSIGN, COLON
As a result, token(s) COLON were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'c'": CHAR, CIRCULAR, CROSS, ID
As a result, token(s) CIRCULAR,CROSS,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'p'": PARAM, PROD, ID
As a result, token(s) PROD,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'w'": WHEN, WHERE, WITHIN, ID
As a result, token(s) WHERE,WITHIN,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'a'": AND, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'n'": NOT, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'f'": FLOAT, ID
As a result, token(s) ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'t'": THEN, TO, ID
As a result, token(s) TO,ID were disabled for that input
warning(209): AMPL.g:140:1: Multiple token rules can match input such as "'s'": SET, SUBJECT, SYMBOLIC, SYMDIFF, SUM, ID
As a result, token(s) SUBJECT,SYMBOLIC,SYMDIFF,SUM,ID were disabled for that input
error(208): AMPL.g:852:1: The following token definitions can never be matched because prior tokens match the same input: BINARY,CIRCULAR,CROSS,DEFAULT,DIFF,DIMENSION,ELSE,INT,INTEGER,INTER,INTERVAL,LIST,LONG,MAXIMIZE,MIN,MINIMIZE,ORDERED,PROD,SUBJECT,SYMBOLIC,SYMDIFF,SUM,TO,WHERE,WITHIN,LE,GE,COLON,MultiLineComment
--
David J. Biesack, SAS
SAS Campus Dr. Cary, NC 27513
www.sas.com (919) 531-7771
More information about the antlr-interest
mailing list