[antlr-interest] Does ANTLR exactly allow Unicode?

新买 inshua at gmail.com
Sun Oct 22 02:59:59 PDT 2006


I had created a simple grammar to study ANTLR. and use Chinese charater as
letter, and ANTLR throws no warning or error.
However, when I input a piece of demo stream,like below:

开始
输出 "开始开始";
结束

it report some aweful error.
line 1:1: unexpected char: 0xBF
 at LearnLexer.nextToken(LearnLexer.java:102)
 at antlr.TokenBuffer.fill(TokenBuffer.java:69)
 at antlr.TokenBuffer.LT(TokenBuffer.java:86)
 at antlr.LLkParser.LT(LLkParser.java:56)
 at LearnParser.multiWriteStatement(LearnParser.java:89)
 at Test.main(Test.java:18)

Trace the lexer, I found an interesting thing. the char "开" is "\u5f00", but
it report with 0xBF.
Somebody tell me how use Unicode by ANTLR exactly,  thanks a lot.

header{
 import java.util.*;
}
class LearnLexer extends Lexer;
options{
  charVocabulary = '\u0003' .. '\uFFFE';
  caseSensitive = false;
  k = 2;
}
String :
 '\"' (~'\"')* '\"'
;

YINHAO :
 '\"';

WS : (' '
 | '\t'
 | '\n'
 | '\r')
  { _ttype = Token.SKIP; }
 ;

WRITE:
 "\u8f93\u51fa"
;

Fenhao : ';'
;

BEGIN :  "\u5f00\u59cb"
;

END   :     "\u5b8c\u6bd5"
;

class LearnParser extends Parser;
options{
 buildAST = true;
}
writeStatement :
   WRITE^ String Fenhao!;

multiWriteStatement :
  BEGIN^ (writeStatement)* END!
;


class LearnTreeWalker extends TreeParser;
multiWriteStatement{
 int i;
}
 : #(a:BEGIN .) {
   for(AST t = a.getFirstChild(); t != null; t = t.getNextSibling()){
    writeStatement(t);
   }
  }
;
writeStatement{
 String s;
}
: #(WRITE s=string) {System.out.print(s);}
;

string returns[String r]{
 r = null;
}
:  s : String {r = s.getText();}
;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061022/851bcf61/attachment.html 


More information about the antlr-interest mailing list