[antlr-interest] Lexing C-style strings - problems matching
characters not in vocab
Chris Seaton
chris at chrisseaton.com
Sat Feb 25 06:44:31 PST 2006
Hello,
I'm writing a lexer that needs to recognise bog standard C-style strings
"like this"
At the moment I'm using the following, which seems to be how most of
the grammars on the site also work,
STRING :
'"' (~('\r' | '\n' | '"' | '\\') | '\\' '"')* '"'
;
Looking at the generated code I can see that this won't work though -
the ~ operator doesn't match any character apart from the ones
specified, it seems to match a set of basic characters minus the
one's I've negated.
I don't think my STRING rule with match characters such as £, ©, ¼
and so on. What do I do about this? Add them explicitly to the
expression? I can't go through the entire Unicode specs adding every
character to my rule - it would be huge.
I looked at Scanning Unicode Characters in the docs, but this only
refers to 16bit Unicode characters - what do I do for characters
outside this arbitary limit?
Thanks very much.
Chris Seaton
More information about the antlr-interest
mailing list