[antlr-interest] Re: Problem With Special Chars - Detailed

Martin Probst mail at martin-probst.com
Mon Jul 25 14:48:56 PDT 2005


Hi,

>   These are the lines from my Vocab file. Is this the one you meant or
> someother? 

I see, you don't know about the charVocabulary option I was referring
to. Quote from a webpage which exactly explains your problem and the
solution:

> charVocabulary: Setting the lexer character vocabulary 
> ANTLR processes Unicode. Because of this this, ANTLR cannot make any
> assumptions about the character set in use, else it would wind up
> generating huge lexers. Instead ANTLR assumes that the character
> literals, string literals, and character ranges used in the lexer
> constitute the entire character set of interest. For example, in this
> lexer:  
> 
> class L extends Lexer;
> A : 'a';
> B : 'b';
> DIGIT : '0' .. '9';   
>   
>     
>     The implied character set is { 'a', 
>       'b', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' }. This can produce 
>       unexpected results if you assume that the normal ASCII character set is 
>       always used. For example, in: 
> class L extends Lexer;
> A : 'a';
> B : 'b';
> DIGIT : '0' .. '9';
> STRING: '"' (~'"")* '"';   
>   
>   
> The lexer rule STRING will only match strings containing 'a', 'b' and
> the digits, which is usually not what you want. To control the
> character set used by the lexer, use the charVocbaulary option. This
> example will use a general eight-bit character set. 
> 
> class L extends Lexer;
> options { charVocabulary =  
> '\3'..'\377';   
>   }   
>    
>   ...    
>    
>      
> 
> This example uses the ASCII character set in conjunction with some
> values from the extended Unicode character set:  
> 
> 
> class L extends Lexer;
> options {
> 	charVocabulary =   '\3'..'\377'
> |
> '\u1000'..'\u1fff';
>    
>  
> 	}



More information about the antlr-interest mailing list