[antlr-interest] Problem with this antlr grammar.
Martin Probst
mail at martin-probst.com
Wed Mar 2 02:32:22 PST 2005
Hi,
> OK. Is there a way of specifying "all characters in UTF-8" or UTF-16
> etc. or do I have to do it by hand?
That depends on your language choice. In Java all characters are UCS-2
characters and by that in ANTLR generated Java code too. So in Java you
can use UCS-2 charVocabularies.
In C++ there is some support for Unicode but I didn't try it. It works
well though if you specify the characters in an own rule, e.g.
UTF8LETTER:
/* European, Arabic, Hebrew */
| '\u00c2'..'\u00df' '\u0080'..'\u00bf'
/* Indic, Thai, CJK, some symbols */
| '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF'
| '\u00E1'..'\u00EC' '\u0080'..'\u00BF' '\u0080'..'\u00BF'
| '\u00ED' '\u0080'..'\u009F' '\u0080'..'\u00BF'
| '\u00EF' '\u00A4'..'\u00BF' '\u0080'..'\u00BF'
/* Custom Area #1 */
| '\u00EE' '\u0080'..'\u00BF' '\u0080'..'\u00BF'
| '\u00EF' '\u0080'..'\u00A3' '\u0080'..'\u00BF'
/* Supplementary characters: more CJK, historical, math, musical
*/
| '\u00F0' '\u0090'..'\u00BF' '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
| '\u00F1'..'\u00F2' '\u0080'..'\u00BF' '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
| '\u00F3' '\u0080'..'\u00AF' '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
/* Custom Area #2 */
| '\u00F3' '\u00B0'..'\u00BF' '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
| '\u00F4' '\u0080'..'\u008F' '\u0080'..'\u00BF'
'\u0080'..'\u00BF';
This is not the perfect way of doing that - in many cases you have to
make sure you don't mix up basechars and ideographic chars. But if you
just have to simply match UTF-8 this should work. The rules don't
include simple ASCII (chars below 0xc2) btw.
Regards,
Martin
More information about the antlr-interest
mailing list