[antlr-interest] Noob Question

Tue Jan 12 18:32:50 PST 2010

JOHN!

THANK YOU! You don't know how long I've been struggling with this - and now
that you explain it, it makes perfect sense!

I will heed your warning about * and ? - I see how they match empty strings
now.

Thanks,
Nik

On Tue, Jan 12, 2010 at 9:21 PM, John B. Brodie <jbb at acm.org> wrote:

> Greetings!
>
> Your WS lexer rule can recognize the empty string, this is VERY bad.
>
> Because WS can recognize the empty string your lexer will enter an
> infinite loop when encountering a character it can not deal with - like
> the '_' in your example - you have no lexer rule that can handle a '_'.
>
> More below...
>
> On Tue, 2010-01-12 at 20:52 -0500, Nik Molnar wrote:
> > Hello all,
> >
> > I am rather new to ANTLR and seem to be running into a small issue I
> can't
> > figure out.
> >
> > I'm writing a very simple grammar based on many tutorials online, the
> > calculator.
> >
> > This grammar generates C# code that compiles perfectly, and works for the
> > most part in ANTLRWorks Interpreter, Debugger and in a sample app I made
> in
> > .NET to call the generated Parser/Lexer.
> >
> > The problem I run into is what I put in invalid syntax, expecting an
> error.
> > Output like so:
> >
> > Valid Syntax: "3+3" => Works in interpreter, debugger and compiled .net
> > code.
> > Invalid Syntax: "3+/3" => Gives error in interpreter, debugger and
> compiled
> > .net code, as expected.
> > Invalid Syntax: "3_3" => The interpreter shows nothing, the debugger
> cannot
> > connect and the .net code hangs for a while then throws an out of memory
> > exception.
>
> Your lexer will correctly identify the first '3' as an INT. Next your
> lexer will see the '_' which it is unable to deal with. BUT since your
> WS rule says that the empty string - the non-stuff between the first '3'
> and the '_' - is legal, your lexer accepts that empty string as a WS
> token and deposits it into the HIDDEN channel. Now the lexer is still
> looking at the '_' which it is unable to deal with. BUT since your WS
> rule says that the empty string - the non-stuff between the first '3'
> and the '_' - is legal, your lexer accepts that empty string as a WS
> token and deposits it into the HIDDEN channel. Now the lexer is still
> looking at the '_' which it is unable to deal with. BUT since your WS
> rule says that the empty string - the non-stuff between the first '3'
> and the '_' - is legal, your lexer accepts that empty string as a WS
> token and deposits it into the HIDDEN channel. Now the lexer is still
> looking at the '_' .... and so nothing good results.
>
> Your .NET app runs out of memory because the infinite sequence of empty
> WS tokens appended onto the HIDDEN channel just gobbles up all memory.
>
> The debugger can not connect because the connections happens after the
> lexer has finished tokenizing the input text. Your lexer never finishes
> so the debugger won't connect. I bet if you waited long enuf you would
> eventually run out of memory in this case too.
>
> Same drill for the interpreter....
>
> >
> > I'm sure I'm doing something wrong in my grammar but don't know what.
> >
> > I've included it below. Please help me!
> >
> > Thanks,
> >
> > grammar Test;
> >
> > /*options
> > {
> > language = 'CSharp2';
> > }*/
> >
> > expression
> >     : amExpression;
> >
> > amExpression
> >     :mdExpression ((PLUS|DASH) mdExpression)*
> >     ;
> >
> > mdExpression
> >     :INT ((STAR|SLASH) INT)*
> >     ;
> >
> > DASH
> >     :'-'
> >     ;
> >
> > SLASH
> >     :'/'
> >     ;
> >
> > WS
> >     : (' '
> >     | '\t'
> >     | '\n'
> >     | '\r')*
> >     { $channel = HIDDEN; }
> >     ;
>
> the * above should really be a +
>
> be VERY careful with rules that can recognize the empty string, e.g.
> have just a * or ? operator.
>
> I have NEVER found an instance where a lexer rule that accepts nothing
> (the empty string) does anything that helps.
>
> On RARE occasions, a parser rule that accepts the empty string can be
> appropriate, but needs to be examined VERY closely.
>
> >
> > STAR
> >     : '*'
> >     ;
> >
> > PLUS
> >     : '+'
> >     ;
> >
> > fragment DIGIT
> >     : '0'..'9'
> >     ;
> >
> > INT
> >     : (DIGIT)+
> >     ;
>
> Hope this helps...
>    -jbb
>
>
>