[antlr-interest] Debugging: how? (Why do I get MismatchedTokenException or UnwantedTokenException?) Unhelpful error messages.

Mon Nov 10 07:24:16 PST 2008

Hendrik Maryns schreef:
> Jim Idle schreef:
>> On Thu, 2008-10-30 at 15:28 +0100, Hendrik Maryns wrote:
>>> John B. Brodie schreef:
>>>> Greetings!
>>>>
>>>> Hendrik Maryns asked:
>>>>
>>>>> I showed you my grammar yesterday.  Now trying it out on some simple
>>>>> inputs blows me away right away: it doesn’t even parse anything.
>>>> Your problem seems to be with your Lexer rule for LABEL which is :
>>>>
>>>> LABEL : ~(')')+ ;
>>>>
>>>> this means that any sequence of characters that is not a ')' must be a
>>>> LABEL.
>>> I am starting to understand the difference between lexer and parser now.
>>>  I was thinking of it as some sort of regular expression parser, but
>>> since the lexer does not know anything about the parser, it doesn’t care
>>> about it.
>>>
>>>> another problem is that ')' is not matched by any Lexer rule. did you
>>>> want OPEN and CLOSE to be parens?
>>> Yes, sorry, a relict of debugging.
>>>
>>>>> Grateful for any suggestions,
>>>>> .....remainder of message snipped....
>>>> Hope this helps
>>> It did, in that I know what is wrong, but I still have no solution to my
>>> problem: how can I make the variable in my label rule be anything?  That
>>> is, I would think anything except whitespace and braces and control
>>> characters would be fine.  In particular, it definitely has to accept
>>> any word in any script, along with some punctuation characters such as .
>>> - _ $ and probably more.
>> There are a couple of solutions, but you don't say what the lexical
>> significance of your labels are, or whether this is a language you are
>> inventing (in which case don't do that), or one you are following a spec
>> for.
> 
> I like your suggestion: don’t do that!
> 
> Well, I am following a spec, but I am free to change it.  Although I
> cannot believe why this wouldn’t be possible: I simply want a lisp-like
> grammar that takes whatever is there.  See my other posts.
> 
> Expected input: (word x whatever), where whatever can be really
> anything, in particular, any word in any human language, so also
> Chinese, etc.  And additionally, some punctuation should be allowed.
> The ‘whatever’ is clearly defined though: it starts after the space and
> ends before the brace.  It would be a piece of cake to write this as a
> regex: /[^ ][^)]*/, but unfortunately, as John pointed out, if I would
> make a lexer rule out of this, it would eat everything, also the (word,
> which of course should not be matched.  I think the lexer rules are
> stupid, it shoul simply apply the rules in order of appearance, I see
> absolutely no reason for this ‘rule which eats most wins’ system.
> 
>> In general, such labels tend to be valid in certain places only, such as
>> the start of a line/statement, only following goto and so on. If this is
>> the case, then you use a semantic predicate to check if you are at the
>> first character position in a line, then consume everything up to
>> whitespace and return LABEL. After goto and gosub, then consume the
>> label spec within the definitions of such keywords, make the text of the
>> token be the label, and extract the label from the token in the parser.
>> You just have to think creatively about the trigger points that indicate
>> a label is/could be, next.
> 
> This seems like the way to go.  Could you write this down in newbie
> words please?  While I can make some sense of it, it is too abstract to
> be able to implement it yet.
> 
>> What language is this? This knowledge may help people help you.
> 
> I describe it at
> http://tcl.sfs.uni-tuebingen.de/MonaSearch/doc/#formula-syntax, but note
> that I can change that if need be.  I would prefer not to, since it
> would break existing formulas, but there are not so many of them.
> 
>> If there are no lexical points that trigger a label interpretation, then
>> the next best thing is to construct a parser rule that accumulates label
>> components:
>>
>> label : WORD ( { checkNoSpace() }?=> labelstuff )* ;
> 
> I have been wondering what this => in some grammars is.  Where can I
> read about it?
> 
>> labelstuff
>>        : WORD | DOT | UNDERSCORE | BANG | keywords ... ;
>>
>> Then build the text of the label from the text of the individual tokens
>> and rewrite as a LABEL for the AST.
>>
>> Can't be any more specific without knowing what you are trying to parse.
>> You usually have to look for specific solutions for your DSL when you
>> get in to this stuff as usually it means the language design was weak in
>> the first place.
> 
> I suppose it is.  I think I should start using quoted strings.  But it
> is also a very educational discussion which, to me is showing off some
> of ANTLR’s weaknesses (such as no \p{alpha} classes).

Maybe someone can give some general suggestion on how to tackle this
issue?  I think I should make a more general ID lexer rule and indeed
check whether it is a proper first or second order variable in the
parser rules instead of having separate lexer rules for that.

More suggestions?  And what is => really?

H.
-- 
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 257 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20081110/d0c10afb/attachment-0001.bin