[antlr-interest] All except...

Tue Jun 5 04:52:00 PDT 2007

At 11:26 5/06/2007, Phil Oliver wrote:
 >how does one create an ANTLR v3 rule (either lexer or parser)
 >that easily matches any character EXCEPT a set of other
 >characters? e.g. let's say I have:
 >
 >Char : '\u0009' | '\u000A' | '\u000D' | '\u0020'..'\uD7FF' |
 >'\uE000'..'\uFFFD';
 >
 >and I want to define a rule that matches any Char except another 

 >list of characters. In the EBNF grammar used in the XQuery spec, 

 >for example, it would be:
 >
 >Char2: Char - ('<' | '>');
 >
 >which would cause Char2 to match any character in Char except 
for
 >'<' or '>'.  But that operator isn't part of ANTLR (evidently). 
I've
 >looked at the ~ unary operator but that doesn't handle this job, 

 >unless I'm overlooking something.

ANTLR doesn't currently support set subtraction, though it does 
support set addition (through the | operator) and negation (with 
~).  Since your Char definition is already a "just about 
everything" set, you should first redeclare it as such:

Char: ~('\u0001'..'\u0008' | '\u000B' | '\u000C' | 
'\u000E'..'\u001F' | '\uD800'..'\uDFFF' | '\uFFFE' | '\uFFFF');

(If you're supporting something higher than UTF-16 then the upper 
bound might be a bit different.)

Then redeclare using a fragment token that also excludes the other 
tokens you want to exclude from Char2:

fragment Char0: ~('\u0001'..'\u0008' | '\u000B' | '\u000C' | 
'\u000E'..'\u001F' | '<' | '>' | '\uD800'..'\uDFFF' | '\uFFFE' | 
'\uFFFF');
Char: Char0 | '<' | '>';
Char2: Char0;