[antlr-interest] Struggling with recursion error

Sat Dec 12 16:51:40 PST 2009

On Sat, Dec 12, 2009 at 7:32 PM, Jim Idle <jimi at temporal-wave.com> wrote:
>
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of André Næss
>> Sent: Saturday, December 12, 2009 8:23 AM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] Struggling with recursion error
>>
>> Hi
>>
>> I've been trying for quite some time now to put together a minimal toy
>> parser, but I'm completely stuck on this recursion error that I don't
>> understand.
>>
>> The parser I'm trying to write is a simple parser for a language that
>> has only let-bindings, if-then-else, integers, variables, and function
>> application. The problem is the latter. A function application
>> typically looks like this:
>>
>>    foo a b 10
>>
>> And I need this to parse as (foo a b 10), not (foo (a (b 10))) (I want
>> to be able to pass functions as values).
>>
>> After some back and forth I figured that by making the topmost expr
>> match either a function application or any other expression, where
>> function applications all begin with ID, I should get the correct
>> parse. And indeed, I do, with backtrack on. If I turn off backtrack,
>> the whole thing fails with
>
> Yes - this isn't the way to do it, you need to combine your ID and function rules in some way. I think that what you are lacking is understanding of LL(n) parsing and how the ANTLR rule layout relates to this. The things in your top level expression have the lowest precedence and the things down at the end of the chain (your atoms) have the highest. When seeing an ID your current rules will pick the higher precedence, which is your atom. So, you need to left factor those as they are ambiguous.
>
> However a bigger problem is that you have a completely ambiguous language, which would only work if this was a runtime interpreted script as without parenthesis to delimit the function arguments, you cannot know what is an ID and what is a function. So, we can make it parse as foo a b 10, but once you see ID, you will consume everything else as arguments to your function up to the ';'. Are you trying to design a language or write something to parse an existing language? The nearest I can see to something like this is VBScript, which was horrendous to write a parser for and the only way to do it was using global backtracking. However, even VBScript is not this ambiguous.

I'm playing around with my own language, but right now it's just
sketches. The "program" rule with semi-colon separate expressions is
just for testing. My language will not have statements (like most
functional languages), and thus all the magic happens in expressions,
and therefore I'm currently focusing on them. And, making function
application greedy for arguments is exactly the behavior I want.

> Designing a language is a difficult thing and you will need a lot more experience before being able to do this well. However, as there are so many horrible languages out there already now, then why not one that is completely logically invalid ;-)

Hehe, it's hard, that's for sure. All the more fun :)

> So, if we rework your example then you will find you get:
>
> [10:21:11] error(211): T.g:25:12: [fatal] rule atomExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
> [10:21:11] error(211): T.g:25:8: [fatal] rule atomExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
>
> Because there is no way to decide between continuing to consume arguments or ending the list of them. Generally the dynamic languages that think it is good to lose things like ',' and '(' ')' just create huge ambiguities. I have never really understood why people think it such a pain to type ';' at the end of the statement - productivity gained through compiler errors and warnings far outweighs any 'pain' of typing ';'.

Well, now I just have to tell a little story of how semicolons can
cause a lot of pain, it happened to me recently. It begins with the
horror that is T-SQL, which is one of my day-job languages. The other
day I started working with Service Broker, a queue technology built
into SQL Server. It's great stuff, but patching it into T-SQL
apparently wasn't entirely painless. To send messages to the queue you
use SEND, e.g.:

SEND ON CONVERSATION ...

Now, I had written a simple procedure to send a message, and I wanted
to try it. But it didn't compile, complaining about some error at a
line that completely didn't make sense.

I spent god knows how much time on this, and then accidentally bumped
into a blog post where I noticed something strange. The writer always
used a semicolons before his SEND statements, as in ";SEND ON
CONVERSATION". And then I noticed a comment saying that the ; was
there because of some issues with the T-SQL parser. I guess that in
T-SQL semicolons are usually optional, but in the case of SEND,
whatever comes before it must apparently end with a semicolon.
Wonderful.

Anyway, I care about syntax, otherwise I would have chosen
S-Expressions as the "syntax" for my language, and therefore I don't
want to pollute the syntax with unnecessary parens and commas and
semi-colons. It may not be a big deal in practice, I guess it's one of
those areas where programmers have very different views, but it's
certainly useful in learning Antlr because it seems to give me a few
extra challenges that require digging a little deeper.

But back to my problem. With a little guidance from you and John it
looks like I've nailed it. I just had to get over the idea of having
completely arbitrary expressions as arguments, and instead only
allowing atom expressions like literals or IDs as arguments, anything
else must be wrapped in parens. Makes more sense in practice anyway.
Also, as you noticed I didn't quite understand the precedence, leading
me to pull the function application all the way up to the root
expression, I've now moved it down so it has the highest precedence.
The final grammar is included below. The only thing I'm not perfectly
happy with is the semantic predicate in the funcAppl rule. It is
necessary to stop Antlr from emitting a warning, but it does seem like
Antlr does exactly what I want without the predicate. I understand the
warning, as Antlr can match both the ID at the start of funcAppl and
at the start of atomExpr. But since it all works just fine without the
predicate it's tempting to lose it and and presumably save some
resources, but I'd like to know for sure.

Thanks for your help :)

grammar Exprs;

options {
    output=AST;
	//backtrack=true;
}

tokens {
	FUNC_APPL;
	VAR_REF;
	LET;
	IFELSE;
}

program
	:	expr (';' expr)*
	;

expr:	ifElseExpr
	|	letExpr
	|	addExpr
	;

addExpr
	:	mulExpr (('+'|'-')^ mulExpr)*
	;

mulExpr
	:	funcAppl (('*'|'/')^ funcAppl)*
	;

funcAppl
	:	(ID atomExpr) => ID atomExpr* -> ^(FUNC_APPL ID atomExpr*)
	|	atomExpr
	;

atomExpr
	:	INT
	|	ID -> ^(VAR_REF ID)
	|	'('! expr^ ')'!
	;

letExpr
	:	'let' letBinding (',' letBinding)* 'in' expr -> ^(LET letBinding+ expr)
	;

letBinding
	:	(ID^ '='! expr)
	;

ifElseExpr
	:	'if' cond=expr 'then' thenPart=expr 'else' elsePart=expr ->
^(IFELSE $cond $thenPart $elsePart)
	;