[antlr-interest] 8 bit ASCII and cpp source code
Jim O'Connor
Jim.O'Connor at microfocus.com
Fri Jan 26 13:25:45 PST 2007
Hi all,
Quick sanity check. Background: Antlr 2.7.5 cpp source,
compiled into a library. I am running into Swedish characters in my
input file octal 330, hex 0xd8, decimal 216.
I limited the charVocab in the test case to
charVocabulary = '\3' .. '\177'|'\330';
Lexer checks for IDENTIFIERs
IDENTIFIER :
(SIMPLE_LETTER) (SIMPLE_LETTER | '_' | '0'..'9')*
;
SIMPLE_LETTER :
'a'..'z' | '\330'
;
the switch statement in the lexer is
void SqlCobolLexer::mSIMPLE_LETTER(bool _createToken) {
int _ttype; ANTLR_USE_NAMESPACE(antlr)RefToken _token;
ANTLR_USE_NAMESPACE(std)string::size_type _begin = text.length();
_ttype = SIMPLE_LETTER;
ANTLR_USE_NAMESPACE(std)string::size_type _saveIndex;
switch ( LA(1)) {
case 0x61 /* 'a' */ :
case 0x62 /* 'b' */ :
case 0x63 /* 'c' */ :
case 0x64 /* 'd' */ :
case 0x65 /* 'e' */ :
case 0x66 /* 'f' */ :
case 0x67 /* 'g' */ :
case 0x68 /* 'h' */ :
case 0x69 /* 'i' */ :
case 0x6a /* 'j' */ :
case 0x6b /* 'k' */ :
case 0x6c /* 'l' */ :
case 0x6d /* 'm' */ :
case 0x6e /* 'n' */ :
case 0x6f /* 'o' */ :
case 0x70 /* 'p' */ :
case 0x71 /* 'q' */ :
case 0x72 /* 'r' */ :
case 0x73 /* 's' */ :
case 0x74 /* 't' */ :
case 0x75 /* 'u' */ :
case 0x76 /* 'v' */ :
case 0x77 /* 'w' */ :
case 0x78 /* 'x' */ :
case 0x79 /* 'y' */ :
case 0x7a /* 'z' */ :
{
matchRange('a','z');
break;
}
case 0xd8:
{
match(static_cast<unsigned char>('\330') /* charlit */ );
break;
}
The charscanner and inputbuffer classes have a return type of int for
LA(). LA(1) returns 0Xffd8 for my "problem" character.
Solution: change the LA() to return unsigned? Ric hinted at such in a
2004 archive note.
Thanks for reading to here
Jim
Hope all is going well
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070126/cb524ae5/attachment-0001.html
More information about the antlr-interest
mailing list