[antlr-interest] Re: proposal for 2.7.4: charVocabulary defaults to ascii 1..127
Mike Lischke
lists at lischke-online.de
Sat May 1 14:40:17 PDT 2004
> Ok, so maybe I should have said
>
> charVocabulary = "UTF-8";
>
> and UTF-16.
Don't use the transformation format identifiers as vocabulary names. This similar as if you would say "base64" instead
of ASCII. These formats do not generally describe a character range (although, sometimes they do as with UTF-16 and
UTF-32). Things get worse if people start to ask whether UTF-16LE or UTF-16BE are meant (which describe the byte order
in the charcter stream). This is something, which is really not in the responsibility of ANTLR.
>The point is more that named character sets have
> an advantage in that error messages can be issued. Ter's
> example of "Korean" is one that would pretty clearly not be
> recognized. Many of the vocabulary problems are failure to
> specify a range, but "Does ANTLR support unicode" is a close second.
Using a speaking name like Korean, Bopomofo etc. would be a help for the grammar writer. Fortunately, the Unicode
standard defines many marked-off Unicode areas for certain languages. However for others this does not hold true. E.g.
German, Swiss, Polish etc. all use the same ranges (LATIN-1, LATIN-extended), so you cannot always specify a specific
language as vocabulary (well you can in one direction because the mapping language name to character values is 1:n, but
the reverse direction is often impossible).
Mike
--
www.soft-gems.net
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list