[antlr-interest] Question about lexer/parser boundaries

Mon Jun 4 14:48:17 PDT 2007

The following grammars use the same simple Extended Backus-Naur Form
(EBNF) notation as [XML 1.0] with the following minor differences. The
notation "< ... >" is used to indicate a grouping of terminals that
together may help disambiguate the individual symbols. 
...

A concrete example of one of the rules is:

[5]  SimpleForClause    ::=    <"for" "$"> VarName
<http://www.w3.org/TR/xquery-xpath-parsing/#prod-xpath-VarName>  "in"
ExprSingle
<http://www.w3.org/TR/xquery-xpath-parsing/#prod-xpath-ExprSingle>  (","
"$" VarName
<http://www.w3.org/TR/xquery-xpath-parsing/#prod-xpath-VarName>  "in"
ExprSingle
<http://www.w3.org/TR/xquery-xpath-parsing/#prod-xpath-ExprSingle> )* 

<"for" "$"> absolutely does not mean literal character left and right
angle brackets, they denote that 'for $' should be treated as a single
unit effectively (and thus I define a lexer rule that matches occurances
of 'for' '$' with a single token.)

OK. I see what they are getting at, but I don't think that would be
necessary for ANTLR. ANTLR probably doesn't care too much though and I
don't think you should worry about it. However, I think that this is
really a matter of the <> indicating a place where it might be easiest
in one tool to return a single token, and easier in another to use a
predicate of some sort. This isn't really to do with LL(*) or whatever,
it is just disambiguation. In your 'for' above, you would have to deal
with whitespace and so on if you want it to be a single token, so your
lexer token would need to deal with that. I suspect that those 'hints'
are actually leading you astray a bit to be honest and you might want to
just start by NOT doing that, then seeing if you can make any
improvements by using such tricks AFTER you have it parsing. I think
they are not telling you to treat them as a single token, but that 'for'
followed by '$' will always mean "simple for loop"

Thanks for you assessment that the other examples are fine. Can you or
Terence comment on some definite hard-and-fast technique for assessing
whether a rule should be lexer or parser? 

I don't think that there is one really. Parsing is quite often a case of
what works best with the tool you are using, because virtually no
languages are 'easy'. But, I usually find that the best thing to do is
keep the lexer as simple as possible with no compound things that I can
do in the parser. 

The most frustrating thing is that I have a grammar that the latest
Antlrworks declares as having "no grammar errors" but it blows up with
out of memory anyway,

Well - there might not be any grammar errors, but that does not mean you
have a CORRECT recognizer J

It is frustrating that ANTLR/Antlrworks doesn't appear to be flagging
all possible problems, which makes it very hard to debug a complex
grammar translation. 

Well, it is a difficult thing. However, the set of all possible problems
may well be infinite. I think you might post your full grammar and lexer
to get some hints and let's see where you are going. Also, this version
of ANTLR3 is written in ANTLR2, which means Ter did not spend ages of
time on errors (though really this means syntax errors in the grammar).
But a tool such as ANTLRworks will let you debug the grammar even though
it cannot tell you if it is suitable for recognizing the input set you
had in mind, so you should be able to find out where it is going wrong
by debugging it, should you not?

Perhaps you are expecting too much from the analysis of your grammar
(perhaps you are not too), but we need to see your grammar to see what
is happening I guess.

Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20070604/57b1c814/attachment.html