[antlr-interest] Match any unicode character
Basil Shkara
bshkara at gmail.com
Fri Nov 30 18:14:12 PST 2007
Thanks Harald for your suggestions. They were very helpful.
I'll document my issues and the solutions I came across in case
someone else is stuck with the same problem.
I have now rewritten my grammar and it operates somewhat as expected
now. The issue before was that I failed to take into account
accurately, the way the lexer consumed my specified tokens. As well I
was specifying huge token ranges like ~(WS | NEWLINE)+ in the midst of
other token declarations which caused the tokens below to be
unreachable.
The current implementation of my grammar no longer relies upon
negating alternates like the token above. Instead I define all my
tokens and then have a final catch all token:
OTHER : .;
which captures everything else not previously consumed above.
Terrence's book provided the insight for this behaviour.
Then when specifying parser rules it becomes simple to specify a valid
range of characters which are acceptable such as:
text : OTHER | DOUBLEQUOTE | SINGLEQUOTE;
This then allows me to use this rule in another rule defining the
specific text I want recognised to add to my AST or perform an action:
recognised : COLON text+ COLON;
Because COLON is defined above OTHER, the COLON tokens are already
consumed and so will not be parsed in rule: text.
So by approaching the grammar design this way, the parser is able to
accept any valid unicode characters possible within the constraints of
the rules.
Hope this helps someone!
Baz.
More information about the antlr-interest
mailing list