[antlr-interest] Ignoring syntax errors

Fri Nov 24 11:42:51 PST 2006

>
> > INVALID_CHARACTER: '\u0001'..'\uFFFE';
> >
> Isn't the range of INVALID_CHARACTER to big? Isn't it almost all UTF-8
> characters? Afaik ASCII ranges from '\u0000' to '\u007F'. When
> INVALID_CHARACTER and the alphabet of your tokens overlap that should be
> a reason for indeterminism.

That's right, I include the entire character set because there are some
ASCII characters that are, in fact, invalid in my grammar. I don't think I
should have to explicitly spell out every invalid ASCII character (what if I
forget one?) By making INVALID_CHARACTER the last rule, it catches all
invalid characters (but no valid characters).

Anyway, I have a solution. I found out that when an invalid character is
encountered, the Lexer throws an exception that the parser does not catch,
so I just have to catch it outside the parser.

Hello David,
>
when your input is not syntactically correct to your grammar, you will
> get a RecognitionException thrown by your Parser anyway, why not using
> this mechanism? In case you find semantic errors you just throw an own
> SemanticException.

This is just FYI because I no longer need help.

The challenge is that semantic errors are detected in parallel with parsing;
for example an "unknown symbol" error might occur when the first token is
examined, and this halts parsing because the AST is more-or-less designed in
such a way that it cannot be semantically invalid, so parsing could not
continue unless I added extra code everywhere to handle a special "error"
state. Unfortunately if parsing is halted, the parser might never discover
that there is a syntax error. That is why I pre-recognise it:

>> real_starting_rule returns[AST u=null]:
>>     (    (starting_rule)=> u=starting_rule
>>     |   (.)*
>>     );
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20061124/612bc998/attachment.html