[antlr-interest] Debugging: how? (Why do I get MismatchedTokenException or UnwantedTokenException?) Unhelpful error messages.

Thu Oct 30 10:54:53 PDT 2008

Jim Idle schreef:
> On Thu, 2008-10-30 at 15:28 +0100, Hendrik Maryns wrote:
>> John B. Brodie schreef:
>> > Greetings!
>> > 
>> > Hendrik Maryns asked:
>> > 
>> >> I showed you my grammar yesterday.  Now trying it out on some simple
>> >> inputs blows me away right away: it doesn’t even parse anything.
>>
>> > Your problem seems to be with your Lexer rule for LABEL which is :
>> > 
>> > LABEL : ~(')')+ ;
>> > 
>> > this means that any sequence of characters that is not a ')' must be a
>> > LABEL.
>>
>> I am starting to understand the difference between lexer and parser now.
>>  I was thinking of it as some sort of regular expression parser, but
>> since the lexer does not know anything about the parser, it doesn’t care
>> about it.
>>
>> > another problem is that ')' is not matched by any Lexer rule. did you
>> > want OPEN and CLOSE to be parens?
>>
>> Yes, sorry, a relict of debugging.
>>
>> >> Grateful for any suggestions,
>> > 
>> >>.....remainder of message snipped....
>> > 
>> > Hope this helps
>>
>> It did, in that I know what is wrong, but I still have no solution to my
>> problem: how can I make the variable in my label rule be anything?  That
>> is, I would think anything except whitespace and braces and control
>> characters would be fine.  In particular, it definitely has to accept
>> any word in any script, along with some punctuation characters such as .
>> - _ $ and probably more.
> 
> There are a couple of solutions, but you don't say what the lexical
> significance of your labels are, or whether this is a language you are
> inventing (in which case don't do that), or one you are following a spec
> for.

I like your suggestion: don’t do that!

Well, I am following a spec, but I am free to change it.  Although I
cannot believe why this wouldn’t be possible: I simply want a lisp-like
grammar that takes whatever is there.  See my other posts.

Expected input: (word x whatever), where whatever can be really
anything, in particular, any word in any human language, so also
Chinese, etc.  And additionally, some punctuation should be allowed.
The ‘whatever’ is clearly defined though: it starts after the space and
ends before the brace.  It would be a piece of cake to write this as a
regex: /[^ ][^)]*/, but unfortunately, as John pointed out, if I would
make a lexer rule out of this, it would eat everything, also the (word,
which of course should not be matched.  I think the lexer rules are
stupid, it shoul simply apply the rules in order of appearance, I see
absolutely no reason for this ‘rule which eats most wins’ system.

> In general, such labels tend to be valid in certain places only, such as
> the start of a line/statement, only following goto and so on. If this is
> the case, then you use a semantic predicate to check if you are at the
> first character position in a line, then consume everything up to
> whitespace and return LABEL. After goto and gosub, then consume the
> label spec within the definitions of such keywords, make the text of the
> token be the label, and extract the label from the token in the parser.
> You just have to think creatively about the trigger points that indicate
> a label is/could be, next.

This seems like the way to go.  Could you write this down in newbie
words please?  While I can make some sense of it, it is too abstract to
be able to implement it yet.

> What language is this? This knowledge may help people help you.

I describe it at
http://tcl.sfs.uni-tuebingen.de/MonaSearch/doc/#formula-syntax, but note
that I can change that if need be.  I would prefer not to, since it
would break existing formulas, but there are not so many of them.

> If there are no lexical points that trigger a label interpretation, then
> the next best thing is to construct a parser rule that accumulates label
> components:
> 
> label : WORD ( { checkNoSpace() }?=> labelstuff )* ;

I have been wondering what this => in some grammars is.  Where can I
read about it?

> labelstuff
>        : WORD | DOT | UNDERSCORE | BANG | keywords ... ;
> 
> Then build the text of the label from the text of the individual tokens
> and rewrite as a LABEL for the AST.
> 
> Can't be any more specific without knowing what you are trying to parse.
> You usually have to look for specific solutions for your DSL when you
> get in to this stuff as usually it means the language design was weak in
> the first place.

I suppose it is.  I think I should start using quoted strings.  But it
is also a very educational discussion which, to me is showing off some
of ANTLR’s weaknesses (such as no \p{alpha} classes).

H.
-- 
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081030/3fac3ba0/attachment.bin