[antlr-interest] C# lexer and unicode

Fri Jan 30 20:17:06 PST 2004

I would like to know if ANTLR's C# parser generator supports unicode.
I have an input that contains some chinese/japanese identifiers and 
they are not being lexed properly. They are simply being skipped from 
the stream. They don't even show up in the lexer's nextToken() method.

I wonder if this is because there is something wrong in my lexer or 
just because it's not yet fully supported.

I have:

  charVocabulary = '\u0000'..'\ufffe';

Here's my whitespace rule:

// Whitespace -- ignored
WS      : ( options { generateAmbigWarnings = false; }
	  : ' '	// blank
	  | '\t'	// tab
	  | "\r\n"    	{newline();} // Windows
	  | ('\r'|'\n') {newline();} // Unix or Mac
	  | '\f'      // form feed
	  | ('\0'..'\10'|'\16'..'\37')  // control characters
	  ) {$setType(Token.SKIP);}
	;

Here's my rule for identifiers:

IDENT
options {testLiterals=true;
         paraphrase="an identifier";}
	: ('\u0080'..'\ufffe'|'a'..'z'|'_') 
('\u0080'..'\ufffe'|'a'..'z'|'_'|'$'|'0'..'9')*
	;

And here's the string I'm trying to parse:

»ù½ð´úÂë VARCHAR(6) NOT NULL

Yahoo! Groups Links

To visit your group on the web, go to:
 http://groups.yahoo.com/group/antlr-interest/

To unsubscribe from this group, send an email to:
 antlr-interest-unsubscribe at yahoogroups.com

Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/