[antlr-interest] Lexing C-style strings - problems matching characters not in vocab

Chris Seaton chris at chrisseaton.com
Sat Feb 25 06:44:31 PST 2006


Hello,

I'm writing a lexer that needs to recognise bog standard C-style strings

"like this"

At the moment I'm using the following, which seems to be how most of  
the grammars on the site also work,

STRING :
     '"' (~('\r' | '\n' | '"' | '\\') | '\\' '"')* '"'
   ;

Looking at the generated code I can see that this won't work though -  
the ~ operator doesn't match any character apart from the ones  
specified, it seems to match a set of basic characters minus the  
one's I've negated.

I don't think my STRING rule with match characters such as £, ©, ¼  
and so on. What do I do about this? Add them explicitly to the  
expression? I can't go through the entire Unicode specs adding every  
character to my rule - it would be huge.

I looked at Scanning Unicode Characters in the docs, but this only  
refers to 16bit Unicode characters - what do I do for characters  
outside this arbitary limit?

Thanks very much.

Chris Seaton


More information about the antlr-interest mailing list