[antlr-interest] Unicode Support

Peggy Fieland madcapmaggie at yahoo.com
Wed Jul 5 10:13:04 PDT 2006


I can't say anything about version 3, but in version
2.7.5 (or 6), it's possible to add unicode support by
modelling it on the unicode example. They use UTF8
encoding.

I have some files (slightly modified from the example)
that I am using -- I added 
   MismatchedUnicodeCharException.cpp  to lib/cpp/src
and 
  MismatchedUnicodeCharException.hpp
  UnicodeCharBuffer.hpp
  UnicodeCharScanner.hpp

to lib/cpp/antlr

it's then possible to define identifiers/strings that
contain unicode characters and parse utf8-encoded
input that contains identifiers and strings with
unicode characters.

If you are interested in gory details (like the files
themselves or more details on the lexer itself) feel
free to contact me.

Peggy

--- Rowan Woodhouse <rowan at querix.com> wrote:

>  Hi,
> 
> I've been looking through the archives/web site etc
> to try to figure this out but I haven't been able to
> come up with a definate answer, so hear goes.
> 
> I am looking at writing a lexer in c/c++ that can
> handle ascii or unicode encoded input files and
> allow the use of unicode characters for things such
> as string literals and identifiers. For example the
> source code could be:
> 
> function main
>   define AAA integer
>   define somename string
>   AAA = 2
>   somename = "BBBB"
>   CCCC(AAA, somename)
> end main
> 
> function CCCC(AAA, somename)
>   if somename == "BABA"
>     return AAA
>   else
>     return 0
> end CCCC
> 
> where AAA, BBBB, CCCC and BABA are all chinese
> character strings.
> 
> Would it be possible to get Antlr to generate a C
> lexer for this? If not what of the above would be
> possible, ie just the string literals?
> 
> Many thanks,
> Rowan
> 



More information about the antlr-interest mailing list