[antlr-interest] Debugging: how? (Why do I get MismatchedTokenException or UnwantedTokenException?) Unhelpful error messages.

John B. Brodie jbb at acm.org
Thu Oct 30 09:08:03 PDT 2008


>...snipped....
>> ...... but I still have no solution to my
> problem: how can I make the variable in my label rule be anything?  That
> is, I would think anything except whitespace and braces and control
> characters would be fine.  In particular, it definitely has to accept
> any word in any script, along with some punctuation characters such as .
> - _ $ and probably more.

The best solution is to redefine your language such a LABEL is a quoted string 
or something similiar that the Lexer can identify.

But you probably don't have control over your language definition, otherwise 
you would have done the redefinition already and moved on to more important 
stuff.

So anyway....

An alternative - that has lots of problems - I hesitate to mention it - is too 
make LABEL a parser rule, like this (just 1 way, other ways are possible, 
probably involving syntax predicates):

// label formulas
label : labelHead VARIABLE labelTail ;

labelHead : FUNCTION | CATEGORY | WORD | LEMMA | MORPHOLOGY ;
labelTail : (~CLOSE)+ ;

Now this also means that a Lexer rule such as:

ANY : . ;

must be added as the VERY LAST rule in the Lexer. This ensures that any 
character not recognized as a token by the other Lexer rules gets identified 
as an ANY token - note that ANY is intentionally not `.+` as that would trip 
across the greedy nature of Antlr's Lexing strategy and consume all 
characters. ANY could be tweeked to exclude control characters and perhaps 
other charcters.

So anyway...

Under the above Parser rules a labelTail will match any non-empty sequence of 
tokens upto but not including a CLOSE token.

but it will also match whitespace and comments - since those tokens are on the 
HIDDEN channel and not seen by the parser.

Not sure if that is what you want, it is, I believe, the same functionality as 
your original LABEL : ~(')')+; rule which also happily ate whitespace and 
comments.....

Another bad part of the above labelTail rule is that it is now a list of 
tokens rather than a single token. So whatever processing you are performing 
upon the parsed result - eventually and AST perhaps? - will be much more 
complicated. And further the list of tokens may seem goofy in that your 
original example of test input: "(word x Einführung)"  the labelTail will be a 
list of 3 tokens: VARIABLE, ANY, VARIABLE. That is, it will be the tokens for 
"Einf" a VARIABLE, "ü" an ANY, and "hrung" a VARIABLE. I do not speak your 
language so I am only marginally bothered by this, but your mileage may vary.

Hope this helps...
   -jbb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20081030/4a2e6124/attachment.html 


More information about the antlr-interest mailing list