[antlr-interest] Match any unicode character

Basil Shkara bshkara at gmail.com
Fri Nov 30 18:14:12 PST 2007


Thanks Harald for your suggestions.  They were very helpful.

I'll document my issues and the solutions I came across in case  
someone else is stuck with the same problem.

I have now rewritten my grammar and it operates somewhat as expected  
now.  The issue before was that I failed to take into account  
accurately, the way the lexer consumed my specified tokens.  As well I  
was specifying huge token ranges like ~(WS | NEWLINE)+ in the midst of  
other token declarations which caused the tokens below to be  
unreachable.

The current implementation of my grammar no longer relies upon  
negating alternates like the token above.  Instead I define all my  
tokens and then have a final catch all token:
OTHER	:	.;
which captures everything else not previously consumed above.   
Terrence's book provided the insight for this behaviour.

Then when specifying parser rules it becomes simple to specify a valid  
range of characters which are acceptable such as:
text	:	OTHER | DOUBLEQUOTE | SINGLEQUOTE;

This then allows me to use this rule in another rule defining the  
specific text I want recognised to add to my AST or perform an action:
recognised	:	COLON text+ COLON;
Because COLON is defined above OTHER, the COLON tokens are already  
consumed and so will not be parsed in rule: text.

So by approaching the grammar design this way, the parser is able to  
accept any valid unicode characters possible within the constraints of  
the rules.

Hope this helps someone!
Baz.


More information about the antlr-interest mailing list