[antlr-interest] Problem with this antlr grammar.

Martin Probst mail at martin-probst.com
Wed Mar 2 02:32:22 PST 2005


Hi,

> OK. Is there a way of specifying "all characters in UTF-8" or UTF-16
> etc. or do I have to do it by hand?

That depends on your language choice. In Java all characters are UCS-2
characters and by that in ANTLR generated Java code too. So in Java you
can use UCS-2 charVocabularies.

In C++ there is some support for Unicode but I didn't try it. It works
well though if you specify the characters in an own rule, e.g.

UTF8LETTER:
	/* European, Arabic, Hebrew */
        | '\u00c2'..'\u00df'  '\u0080'..'\u00bf'
        /* Indic, Thai, CJK, some symbols */
        | '\u00E0'  '\u00A0'..'\u00BF'  '\u0080'..'\u00BF'
        | '\u00E1'..'\u00EC'  '\u0080'..'\u00BF'  '\u0080'..'\u00BF'
        | '\u00ED'  '\u0080'..'\u009F'  '\u0080'..'\u00BF'
        | '\u00EF'  '\u00A4'..'\u00BF'  '\u0080'..'\u00BF'
        /* Custom Area #1 */
        | '\u00EE'  '\u0080'..'\u00BF'  '\u0080'..'\u00BF'
        | '\u00EF'  '\u0080'..'\u00A3'  '\u0080'..'\u00BF'
        /* Supplementary characters: more CJK, historical, math, musical
*/
        | '\u00F0'  '\u0090'..'\u00BF'  '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
        | '\u00F1'..'\u00F2'  '\u0080'..'\u00BF'  '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
        | '\u00F3'  '\u0080'..'\u00AF'  '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
        /* Custom Area #2 */
        | '\u00F3'  '\u00B0'..'\u00BF'  '\u0080'..'\u00BF'
'\u0080'..'\u00BF'
        | '\u00F4'  '\u0080'..'\u008F'  '\u0080'..'\u00BF'
'\u0080'..'\u00BF';

This is not the perfect way of doing that - in many cases you have to
make sure you don't mix up basechars and ideographic chars. But if you
just have to simply match UTF-8 this should work. The rules don't
include simple ASCII (chars below 0xc2) btw.

Regards,
Martin



More information about the antlr-interest mailing list