[antlr-interest] UTF-8, charVocabulary in options in 3.3

Sat Jun 30 02:54:14 PDT 2012

Hi,

I have edited the input file with Putty in a Linux console and the
session encoding set to UTF-8. Now I have created the file also with
Notepad++, also set the encoding to UTF-8 and I have the same
behaviour. Is there an easy way to print out ANTLRFileStream? I
suspect that I am looking for the wrong character code in the grammar
file ...

TIA,
Matej

2012/6/29 Bart Kiers <bkiers at gmail.com>:
> On Fri, Jun 29, 2012 at 12:26 PM, Matej Mailing <mailing at tam.si> wrote:
>>
>> Hi,
>>
>> I am new to antlr but already have an issue. I have an input file that
>> contains some UTF-8 characters (like U+0161 -
>> http://www.fileformat.info/info/unicode/char/161/index.htm) and I am
>> using ANTLRFileStream(inputfile, "UTF-8") to get the input which is in
>> UTF-8 as it should be. However, when I do
>> "RES      : '\u0161' ;"
>>
>> it never matches - I get input1 line 1:0 no viable alternative at
>> character 'š' message.
>>
>> When I add the following segment to the grammar file:
>>
>> "options
>> {
>>           charVocabulary='\u0000'..'\uFFFE';
>> }"
>>
>> I get an error:
>> "internal error:  : java.lang.Error: Error parsing grammar.g: '\uFFFE'
>> not expected ';'"
>> ...
>> error(100): grammar.g:5:24: syntax error: antlr: grammar.g:5:24:
>> expecting SEMI, found '..'
>> error(133): grammar.g:3:1: illegal option charVocabulary"
>>
>> I have been googling around for quite some time and none of the
>> solutions seems to be working. What am I doing wrong?
>>
>
> charVocabulary is an (old) ANTLR v2 option, ANTLR v3 doesn't need it: v3
> accepts the range 0x0000..0xFFFF by default. So remove the option
> charVocabular.
>
> My guess is that you didn't safe the input file containing 0x0161 properly
> (I'm guessing it's saved as plain ASCII). Make sure you safe it as
> Unicode/UTF-xx
>
> Regards,
>
> Bart.
>