[antlr-interest] Full error reporting with the simplest grammar
Curtis Clauson
NOSPAM at TheSnakePitDev.com
Wed Nov 7 17:29:46 PST 2007
I am trying to create what should be the simplest grammar possible (with
EOF) that provides full error reporting, since a grammar without full
error reporting is worthless. I have solved some undocumented
peculiarities, but I do not know if I have done so in a manner
consistent with the intended design of AntLR. I describe 3 problems
below, the first two with possible solutions and the third still unsolved.
What I need to know is:
1 Are the solutions I used for problems 1 and 2 consistent with the
intended design of AntLR, or are there better ways?
2 How do I deal with problem 3?
The grammar is:
----------
grammar SingleCharacter;
@header {
import static java.lang.System.out;
}
/* Parser Rules */
singleCharacter
returns [boolean succeeded = false]
: Character {
out.println(
"Parsed token Character '" + $Character.text + "'"
);
}
EOF {
out.println("Parsed token EOF");
}
{
out.println("Parsed singleCharacter");
$succeeded = true;
}
;
/* Lexer Rules */
Character: 'a';
// Invalid added for problem #2
Invalid : . {$type = Token.INVALID_TOKEN_TYPE;};
----------
Problem #1
There is a serious bug in the lexer that causes it, during error
recovery, to skip two characters instead of just the unexpected
character. When a character is not matched, match() creates an exception
object 'mte', calls recover(mte) which consumes the unexpected
character, and then throws the exception. However, nextToken() catches
that exception, reports the error, and then calls recover(mte) again,
erroneously consuming the token after the already consumed unexpected token.
In my simple example, the source "ba" produces the following results:
Note: The Invalid token did not exist at this point.
<<
Parsing: "ba"
Token stream
<No tokens>
Parser output
line 1:0 mismatched character 'b' expecting 'a'
BR.recoverFromMismatchedToken
line 0:-1 mismatched input '<EOF>' expecting Character
Returned false
>>
The lexer does not provide any tokens to the token stream.
This is what actually happens in the lexer:
1 A call to match('a') is is made by mCharacter()
2 The first character seen is an invalid 'b'
2 A MismatchedTokenException object is created
4 The recover(mte) method is called that consumes the current invalid
character 'b'
4 The exception object is thrown
5 The exception is caught by nextToken()
6 A call to reportError(re) is made that displays the lexer error
7 Another call to recover(re) is made that consumes the next
character, the valid character 'a'
8 nextToken() loops back to try to get a token again, sees EOF, and
returns Token.EOF_TOKEN
9 The parser has no tokens in the stream and reports it saw <EOF>
Given the flow of the code and the use of nextToken(), it seems the
solution is to eliminate the call to recover(re) in the exception
handler of nextToken(). It works fine for my simple example, but I'm not
sure if this is consistent with the intended design of AntLR.
----------
Problem #2
I originally did not have the "Invalid" token. I quickly discovered that
CommonTokenStream parses *ALL* of the tokens in the source on the first
call for a token. I had expected that it would only buffer the consumed
and look-ahead tokens so that any lexer exception would be caught by the
parser and could be reported in parser context. I do not understand and
it is undocumented for what purpose CommonTokenStream does this.
I also found that the lexer is not capable of propagating a
RecognitionException since nextToken has a catch hard-coded in. There
seems to be no way in the grammar to configure this outside of
overriding the nextToken() method.
The idea is to use AntLR "as is" as much as possible, so implementing a
new TokenStream that only buffers consumed or look-ahead tokens is out.
It seems that the only solution is to record lexer errors in the token
stream as <invalid> tokens. This could be done by defining <invalid>
tokens in the grammar with a type assigning action, or overriding the
nextToken() method and altering the exception handler.
Both of these solutions work for my simple example. Also, since they
both eliminate the possibility of a RecognitionException being thrown,
they mask problem #1.
I chose the grammar solution for this example, but it might not be
possible for a more complex lexer, and I do not know if this is a
solution that is consistent with the intended design.
----------
Problem #3
For this problem, I do not have a solution, and it is a show-stopper.
The parser, when it sees an unexpected token, reports the error and
tries two kinds of recovery. If the following token is of the expected
type, it consumes the current token and returns a match, skipping the
unexpected token. If the current token can follow the expected token, it
returns a match and does not consume a token, continuing with the
current token as if it had seen the missing token.
This means that match() might match the next or a non-existing token
instead of the current. However, when a rule action uses a token
reference, like in my grammar where the action in singleCharacter that
follows Character uses the $Character.text field to report the string
that was matched, that reference is obtained by AntLR before the call to
match() from the input stream like so:
Character1=(Token)input.LT(1);
match(input,Character,FOLLOW_Character_in_singleCharacter33);
out.println("Parsed token Character '" + Character1.getText() + "'");
If the call to match() performs error recovery, that token reference
will *NOT* be the one that matched. The matched token might be the next
token if the current one was skipped, or might be no token if the
current one was a valid follow token.
Since the code that gets the reference is automatically generated by
AntLR, and since match() does not record or return the matched token, I
see no way to tell AntLR to handle this correctly.
In my simple example, if the source text is "ba", I get the following
result:
<<
Parsing: "ba"
Token stream
0; channel[0] <invalid>, "b" at 1:0
1; channel[0] Character, "a" at 1:1
Parser output
BR.recoverFromMismatchedToken
line 1:0 mismatched input 'b' expecting Character
Parsed token Character 'b'
Parsed token EOF
Parsed singleCharacter
Returned true
>>
The problem is demonstrated by the first action output:
1 The <invalid> 'b' token is parsed
2 Error recovery is performed that skips the current token since the
next is the expected Character 'a'
3 The Character match() returns successfully
4 The Character action prints the result
However, the $Character reference is still the skipped <invalid> 'b'
token instead of the now current Character 'a' token. So the message
says it "Parsed token Character 'b'" instead of "Parsed token Character
'a'".
As this problem creates incorrect data within the action, it is data
corruption and a show-stopper in any environment. Is there a way to deal
with this in AntLR, or is this an unresolved bug?
More information about the antlr-interest
mailing list