[antlr-interest] Full error reporting with the simplest grammar

Thu Jan 10 12:09:32 PST 2008

Problem #1 Added

http://www.antlr.org:8888/browse/ANTLR-209

On Nov 7, 2007, at 5:29 PM, Curtis Clauson wrote:
> ----------
> Problem #2
>
> I originally did not have the "Invalid" token. I quickly discovered  
> that CommonTokenStream parses *ALL* of the tokens in the source on  
> the first call for a token. I had expected that it would only buffer  
> the consumed and look-ahead tokens so that any lexer exception would  
> be caught by the parser and could be reported in parser context. I  
> do not understand and it is undocumented for what purpose  
> CommonTokenStream does this.

Note: some of what you describe as undocumented is actually in the  
book ;)

> I also found that the lexer is not capable of propagating a  
> RecognitionException since nextToken has a catch hard-coded in.  
> There seems to be no way in the grammar to configure this outside of  
> overriding the nextToken() method.

That is by design, though not what some of you want obviously.  Just  
have it  Throw an error instead of an exception or some other kind of  
exception.

> The idea is to use AntLR "as is" as much as possible, so  
> implementing a new TokenStream that only buffers consumed or look- 
> ahead tokens is out.

THere is also an unbuffered stream if you take a look. There is  
absolutely no limitation in ANTLR that says it must buffer the input;  
I was just being efficient.

>  It seems that the only solution is to record lexer errors in the  
> token stream as <invalid> tokens. This could be done by defining  
> <invalid> tokens in the grammar with a type assigning action, or  
> overriding the nextToken() method and altering the exception handler.
>
> Both of these solutions work for my simple example. Also, since they  
> both eliminate the possibility of a RecognitionException being  
> thrown, they mask problem #1.
>
> I chose the grammar solution for this example, but it might not be  
> possible for a more complex lexer, and I do not know if this is a  
> solution that is consistent with the intended design.
>
> ----------
> Problem #3
>
> For this problem, I do not have a solution, and it is a show-stopper.
>   at
> The parser, when it sees an unexpected token, reports the error and  
> tries two kinds of recovery. If the following token is of the  
> expected type, it consumes the current token and returns a match,  
> skipping the unexpected token. If the current token can follow the  
> expected token, it returns a match and does not consume a token,  
> continuing with the current token as if it had seen the missing token.
>
> This means that match() might match the next or a non-existing token  
> instead of the current. However, when a rule action uses a token  
> reference, like in my grammar where the action in singleCharacter  
> that follows Character uses the $Character.text field to report the  
> string that was matched, that reference is obtained by AntLR before  
> the call to match() from the input stream like so:
>    Character1=(Token)input.LT(1);
>    match(input,Character,FOLLOW_Character_in_singleCharacter33);
>    out.println("Parsed token Character '" + Character1.getText() +  
> "'");
>
> If the call to match() performs error recovery, that token reference  
> will *NOT* be the one that matched.

That is correct and a known issue.	for v3.1 I'm thinking about turning  
off this feature because, while it is great in journal papers and to  
show off their recovery, actions are always screwed up by this error  
recovery in particular.  Any error recovery can cause trouble, but  
this one is particularly vexing. I believe you can simply override  
match and so on, but I am looking at this myself at the moment.

Ter