[antlr-interest] Re: proposal for 2.7.4: charVocabulary defaults to ascii 1..127

Mike Lischke lists at lischke-online.de
Sat May 1 14:40:17 PDT 2004


> Ok, so maybe I should have said
> 
> charVocabulary = "UTF-8";
> 
> and UTF-16.  

Don't use the transformation format identifiers as vocabulary names. This similar as if you would say "base64" instead
of ASCII. These formats do not generally describe a character range (although, sometimes they do as with UTF-16 and
UTF-32). Things get worse if people start to ask whether UTF-16LE or UTF-16BE are meant (which describe the byte order
in the charcter stream). This is something, which is really not in the responsibility of ANTLR.

>The point is more that named character sets have 
> an advantage in that error messages can be issued.  Ter's 
> example of "Korean" is one that would pretty clearly not be 
> recognized.  Many of the vocabulary problems are failure to 
> specify a range, but "Does ANTLR support unicode" is a close second.

Using a speaking name like Korean, Bopomofo etc. would be a help for the grammar writer. Fortunately, the Unicode
standard defines many marked-off Unicode areas for certain languages. However for others this does not hold true. E.g.
German, Swiss, Polish etc. all use the same ranges (LATIN-1, LATIN-extended), so you cannot always specify a specific
language as vocabulary (well you can in one direction because the mapping language name to character values is 1:n, but
the reverse direction is often impossible).

Mike
--
www.soft-gems.net



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list