[antlr-interest] Re: Problem With Special Chars - Detailed
Martin Probst
mail at martin-probst.com
Mon Jul 25 14:48:56 PDT 2005
Hi,
> These are the lines from my Vocab file. Is this the one you meant or
> someother?
I see, you don't know about the charVocabulary option I was referring
to. Quote from a webpage which exactly explains your problem and the
solution:
> charVocabulary: Setting the lexer character vocabulary
> ANTLR processes Unicode. Because of this this, ANTLR cannot make any
> assumptions about the character set in use, else it would wind up
> generating huge lexers. Instead ANTLR assumes that the character
> literals, string literals, and character ranges used in the lexer
> constitute the entire character set of interest. For example, in this
> lexer:
>
> class L extends Lexer;
> A : 'a';
> B : 'b';
> DIGIT : '0' .. '9';
>
>
> The implied character set is { 'a',
> 'b', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' }. This can produce
> unexpected results if you assume that the normal ASCII character set is
> always used. For example, in:
> class L extends Lexer;
> A : 'a';
> B : 'b';
> DIGIT : '0' .. '9';
> STRING: '"' (~'"")* '"';
>
>
> The lexer rule STRING will only match strings containing 'a', 'b' and
> the digits, which is usually not what you want. To control the
> character set used by the lexer, use the charVocbaulary option. This
> example will use a general eight-bit character set.
>
> class L extends Lexer;
> options { charVocabulary =
> '\3'..'\377';
> }
>
> ...
>
>
>
> This example uses the ASCII character set in conjunction with some
> values from the extended Unicode character set:
>
>
> class L extends Lexer;
> options {
> charVocabulary = '\3'..'\377'
> |
> '\u1000'..'\u1fff';
>
>
> }
More information about the antlr-interest
mailing list