[antlr-interest] Noob Question

John B. Brodie jbb at acm.org
Tue Jan 12 18:21:03 PST 2010


Greetings!

Your WS lexer rule can recognize the empty string, this is VERY bad.

Because WS can recognize the empty string your lexer will enter an
infinite loop when encountering a character it can not deal with - like
the '_' in your example - you have no lexer rule that can handle a '_'.

More below...

On Tue, 2010-01-12 at 20:52 -0500, Nik Molnar wrote:
> Hello all,
> 
> I am rather new to ANTLR and seem to be running into a small issue I can't
> figure out.
> 
> I'm writing a very simple grammar based on many tutorials online, the
> calculator.
> 
> This grammar generates C# code that compiles perfectly, and works for the
> most part in ANTLRWorks Interpreter, Debugger and in a sample app I made in
> .NET to call the generated Parser/Lexer.
> 
> The problem I run into is what I put in invalid syntax, expecting an error.
> Output like so:
> 
> Valid Syntax: "3+3" => Works in interpreter, debugger and compiled .net
> code.
> Invalid Syntax: "3+/3" => Gives error in interpreter, debugger and compiled
> .net code, as expected.
> Invalid Syntax: "3_3" => The interpreter shows nothing, the debugger cannot
> connect and the .net code hangs for a while then throws an out of memory
> exception.

Your lexer will correctly identify the first '3' as an INT. Next your
lexer will see the '_' which it is unable to deal with. BUT since your
WS rule says that the empty string - the non-stuff between the first '3'
and the '_' - is legal, your lexer accepts that empty string as a WS
token and deposits it into the HIDDEN channel. Now the lexer is still
looking at the '_' which it is unable to deal with. BUT since your WS
rule says that the empty string - the non-stuff between the first '3'
and the '_' - is legal, your lexer accepts that empty string as a WS
token and deposits it into the HIDDEN channel. Now the lexer is still
looking at the '_' which it is unable to deal with. BUT since your WS
rule says that the empty string - the non-stuff between the first '3'
and the '_' - is legal, your lexer accepts that empty string as a WS
token and deposits it into the HIDDEN channel. Now the lexer is still
looking at the '_' .... and so nothing good results.

Your .NET app runs out of memory because the infinite sequence of empty
WS tokens appended onto the HIDDEN channel just gobbles up all memory.

The debugger can not connect because the connections happens after the
lexer has finished tokenizing the input text. Your lexer never finishes
so the debugger won't connect. I bet if you waited long enuf you would
eventually run out of memory in this case too.

Same drill for the interpreter....

> 
> I'm sure I'm doing something wrong in my grammar but don't know what.
> 
> I've included it below. Please help me!
> 
> Thanks,
> 
> grammar Test;
> 
> /*options
> {
> language = 'CSharp2';
> }*/
> 
> expression
>     : amExpression;
> 
> amExpression
>     :mdExpression ((PLUS|DASH) mdExpression)*
>     ;
> 
> mdExpression
>     :INT ((STAR|SLASH) INT)*
>     ;
> 
> DASH
>     :'-'
>     ;
> 
> SLASH
>     :'/'
>     ;
> 
> WS
>     : (' '
>     | '\t'
>     | '\n'
>     | '\r')*
>     { $channel = HIDDEN; }
>     ;

the * above should really be a +

be VERY careful with rules that can recognize the empty string, e.g.
have just a * or ? operator.

I have NEVER found an instance where a lexer rule that accepts nothing
(the empty string) does anything that helps.

On RARE occasions, a parser rule that accepts the empty string can be
appropriate, but needs to be examined VERY closely.

> 
> STAR
>     : '*'
>     ;
> 
> PLUS
>     : '+'
>     ;
> 
> fragment DIGIT
>     : '0'..'9'
>     ;
> 
> INT
>     : (DIGIT)+
>     ;

Hope this helps...
   -jbb




More information about the antlr-interest mailing list