[antlr-interest] How to use arabic letters in my tokens ?

shmuel siegel antlr at shmuelhome.mine.nu
Wed Mar 26 14:09:36 PDT 2008


Gavin Lambert wrote:
> At 08:25 27/03/2008, Ahmed Hamouda wrote:
>> I want to define a tokens as all possible letters that user can use
>> These letters contain Arabic letters.
>> I tried to add them by hand as the following ‘Ç’ | ‘È’ | ‘Ì’…. and 
>> so, on but I received an error in the generation
>
> Firstly, those don't appear Arabic to me; just regular wider latin 
> characters. Secondly, you can't write Unicode characters directly in 
> either ANTLRv2 or ANTLRv3 since ANTLRv2 doesn't support Unicode at all 
> and ANTLRv3 still uses ANTLRv2 to parse the grammars themselves. 
> (ANTLRv3 grammars can recognise Unicode characters though.)
>
Interesting, the original appeared as Arabic on my computer (which is 
not a localized Arabic machine) so there is an encoding problem somewhere.

>> I also tried to use these alternatives
>>
>> | '\u00c2' | '\u00c3' | '\u00c4' | '\u00c5' | '\u00c6' | '\u00c7' | 
>> '\u00c8' | '\u00c9'
>> | '\u00c0' | '\u00ca' | '\u00cb' | '\u00cc' | '\u00cd' |
> [...]
>
> First, when there's a contiguous range you can specify it like so:
> '\u00c0'..'\u00c7'
>
> And again, those don't appear to be Arabic characters. Run "charmap" 
> and make sure you switch it to Unicode mode. You're probably putting 
> in the ANSI encodings from your Arabic codepage instead.
>

If you are using Antlr3 and Java you will need to use the Unicode Arabic 
code block which starts somewhere around '\u0600'.
The full table can be found here. http://unicode.org/charts/PDF/U0600.pdf



More information about the antlr-interest mailing list