[antlr-interest] Full error reporting with the simplest grammar

Fri Nov 9 15:44:57 PST 2007

Now I know you AntLR implementers and designers are out there because 
I've seen you submit replies while I've been hanging out trying to help 
with other questions.

I seriously need help with these three issues. Your assistance, or at 
least acknowledgment, would be greatly appreciated.
<big brown puppy-eyes>

-- Curtis

Curtis Clauson wrote:
> I am trying to create what should be the simplest grammar possible (with 
> EOF) that provides full error reporting, since a grammar without full 
> error reporting is worthless. I have solved some undocumented 
> peculiarities, but I do not know if I have done so in a manner 
> consistent with the intended design of AntLR. I describe 3 problems 
> below, the first two with possible solutions and the third still unsolved.
> 
> What I need to know is:
>   1 Are the solutions I used for problems 1 and 2 consistent with the 
> intended design of AntLR, or are there better ways?
>   2 How do I deal with problem 3?
> 
> The grammar is:
> ----------
> grammar SingleCharacter;
> 
> @header {
> import static java.lang.System.out;
> }
> 
> 
> /* Parser Rules */
> singleCharacter
> returns [boolean succeeded = false]
>     :   Character {
>             out.println(
>                 "Parsed token Character '" + $Character.text + "'"
>             );
>         }
>         EOF {
>             out.println("Parsed token EOF");
>         }
>         {
>             out.println("Parsed singleCharacter");
>             $succeeded = true;
>         }
>     ;
> 
> /* Lexer Rules */
> Character: 'a';
> // Invalid added for problem #2
> Invalid  :  .  {$type = Token.INVALID_TOKEN_TYPE;};
> 
> ----------
> Problem #1
> 
> There is a serious bug in the lexer that causes it, during error 
> recovery, to skip two characters instead of just the unexpected 
> character. When a character is not matched, match() creates an exception 
> object 'mte', calls recover(mte) which consumes the unexpected 
> character, and then throws the exception. However, nextToken() catches 
> that exception, reports the error, and then calls recover(mte) again, 
> erroneously consuming the token after the already consumed unexpected 
> token.
> 
> In my simple example, the source "ba" produces the following results:
> Note: The Invalid token did not exist at this point.
> <<
> Parsing: "ba"
> Token stream
>     <No tokens>
> 
> Parser output
> line 1:0 mismatched character 'b' expecting 'a'
> BR.recoverFromMismatchedToken
> line 0:-1 mismatched input '<EOF>' expecting Character
> 
> Returned false
>  >>
> 
> The lexer does not provide any tokens to the token stream.
> This is what actually happens in the lexer:
>   1 A call to match('a') is is made by mCharacter()
>   2 The first character seen is an invalid 'b'
>   2 A MismatchedTokenException object is created
>   4 The recover(mte) method is called that consumes the current invalid 
> character 'b'
>   4 The exception object is thrown
>   5 The exception is caught by nextToken()
>   6 A call to reportError(re) is made that displays the lexer error
>   7 Another call to recover(re) is made that consumes the next 
> character, the valid character 'a'
>   8 nextToken() loops back to try to get a token again, sees EOF, and 
> returns Token.EOF_TOKEN
>   9 The parser has no tokens in the stream and reports it saw <EOF>
> 
> Given the flow of the code and the use of nextToken(), it seems the 
> solution is to eliminate the call to recover(re) in the exception 
> handler of nextToken(). It works fine for my simple example, but I'm not 
> sure if this is consistent with the intended design of AntLR.
> 
> ----------
> Problem #2
> 
> I originally did not have the "Invalid" token. I quickly discovered that 
> CommonTokenStream parses *ALL* of the tokens in the source on the first 
> call for a token. I had expected that it would only buffer the consumed 
> and look-ahead tokens so that any lexer exception would be caught by the 
> parser and could be reported in parser context. I do not understand and 
> it is undocumented for what purpose CommonTokenStream does this.
> 
> I also found that the lexer is not capable of propagating a 
> RecognitionException since nextToken has a catch hard-coded in. There 
> seems to be no way in the grammar to configure this outside of 
> overriding the nextToken() method.
> 
> The idea is to use AntLR "as is" as much as possible, so implementing a 
> new TokenStream that only buffers consumed or look-ahead tokens is out. 
> It seems that the only solution is to record lexer errors in the token 
> stream as <invalid> tokens. This could be done by defining <invalid> 
> tokens in the grammar with a type assigning action, or overriding the 
> nextToken() method and altering the exception handler.
> 
> Both of these solutions work for my simple example. Also, since they 
> both eliminate the possibility of a RecognitionException being thrown, 
> they mask problem #1.
> 
> I chose the grammar solution for this example, but it might not be 
> possible for a more complex lexer, and I do not know if this is a 
> solution that is consistent with the intended design.
> 
> ----------
> Problem #3
> 
> For this problem, I do not have a solution, and it is a show-stopper.
> 
> The parser, when it sees an unexpected token, reports the error and 
> tries two kinds of recovery. If the following token is of the expected 
> type, it consumes the current token and returns a match, skipping the 
> unexpected token. If the current token can follow the expected token, it 
> returns a match and does not consume a token, continuing with the 
> current token as if it had seen the missing token.
> 
> This means that match() might match the next or a non-existing token 
> instead of the current. However, when a rule action uses a token 
> reference, like in my grammar where the action in singleCharacter that 
> follows Character uses the $Character.text field to report the string 
> that was matched, that reference is obtained by AntLR before the call to 
> match() from the input stream like so:
>     Character1=(Token)input.LT(1);
>     match(input,Character,FOLLOW_Character_in_singleCharacter33);
>     out.println("Parsed token Character '" + Character1.getText() + "'");
> 
> If the call to match() performs error recovery, that token reference 
> will *NOT* be the one that matched. The matched token might be the next 
> token if the current one was skipped, or might be no token if the 
> current one was a valid follow token.
> 
> Since the code that gets the reference is automatically generated by 
> AntLR, and since match() does not record or return the matched token, I 
> see no way to tell AntLR to handle this correctly.
> 
> In my simple example, if the source text is "ba", I get the following 
> result:
> 
> <<
> Parsing: "ba"
> Token stream
>     0; channel[0] <invalid>, "b" at 1:0
>     1; channel[0] Character, "a" at 1:1
> 
> Parser output
> BR.recoverFromMismatchedToken
> line 1:0 mismatched input 'b' expecting Character
> Parsed token Character 'b'
> Parsed token EOF
> Parsed singleCharacter
> 
> Returned true
>  >>
> 
> The problem is demonstrated by the first action output:
>   1 The <invalid> 'b' token is parsed
>   2 Error recovery is performed that skips the current token since the 
> next is the expected Character 'a'
>   3 The Character match() returns successfully
>   4 The Character action prints the result
> 
> However, the $Character reference is still the skipped <invalid> 'b' 
> token instead of the now current Character 'a' token. So the message 
> says it "Parsed token Character 'b'" instead of "Parsed token Character 
> 'a'".
> 
> As this problem creates incorrect data within the action, it is data 
> corruption and a show-stopper in any environment. Is there a way to deal 
> with this in AntLR, or is this an unresolved bug?