[antlr-interest] how to force unexpected token error

Thu Nov 13 05:04:18 PST 2003

Hi,

On Sun, Nov 09, 2003 at 03:01:33AM -0000, hawkwall wrote:
> Input:
> THREAT.CLASSES.110
> NUMBER.OF.SURFACE.TO.AIR.THREAT.CLASSES:  3
> end of Input:
>
> Parser:
> startSACLASS : (rules)+ ;
>
> rules : threatclass
>         | sathreatclass
>         ;
>
> threatclass: THREAT_CLASSES ONETEN;

Better use NUMBER in stead of ONETEN and then check here if it's "110". If
not so throw a SemanticException or whatever Exception you deem more
apropriate (RecognitionException?).

> sathreatclass: SURFACE_TO_AIR COLON  NUMBER
> 	{System.out.println("Got Here");}
> 	;
> end of Parser:
>
> Lexer:
> options {
> 	k=5; // character lookahead

Not really necessary to have this much lookahead. (Ok antlr optimizes
excess checks away but with bigger stuff it makes running antlr slower)

> 	testLiterals=false;
> }
>
> tokens
> {
> 	THREAT_CLASSES="THREAT.CLASSES.";
> 	SURFACE_TO_AIR="NUMBER.OF.SURFACE.TO.AIR.THREAT.CLASSES";
> }
>
> IDENTIFIER options { testLiterals=true;} : (LETTER | '.')+;
> NUMBER : (DIGIT)+;
> DOT : '.' ;
> COLON : ':';
> ONETEN : ("110") => "110" ; //predicate is an attempt to remove
> nondetermism with NUMBER, but didn't work

I'd remove the ONETEN rule better deal with it in the parser... At least
it's kinda ugly like this ;) Also the rule might interfere with other valid
uses of 110 as a number. E.g. like this you have to deal in the parser in
all spots where you have a NUMBER token with an extra alternative ONETEN.

e.g. the choice between one NUMBER rule (and no ONETEN) and in the parser
in a few spots a check on 110. Or a NUMBER and a ONETEN rule and in the
parser for all NUMBER occurences (NUMBER|ONETEN) if NUMBER is common in the
rest of the grammar the choice is obvious.

> private DIGIT : ('0'..'9') ;
> private LETTER : ('A'..'Z');
>
...
> end of Lexer:
>
> I need the parser to catch it if the input is mispelled.
> The parser complains if I change the first line to
> THREAT.CASSES.110 or THREAT.CLASSES.112
>
> It doesn't fail when I correct the first line and change the second
> line to something like
> NUMBER.OF.USRFACE.TO.AIR.THREAT.CLASSES: 3
>
> I turned on the trace, and with the incorrect input on the second
> line, it matches IDENTIFIER and
> then finished normally.  The action is never executed.  What is the
> difference?

Because there's no EOF check it just came to the conclusion that the input
upto now was valid and it could exit (at least that's my guess). IDENTIFIER
is a valid token in your lexer but your parser does not process it as a
result it matches any misspelled keyword and the parser does not require
any more tokens so it just stops if it received some valid input. Having
EOF at the end of the start rule is very good practice in general (although
in some rarer cases you don't want it)

> Why is unexpected token given
> in the first case but not the other.  I tried setting
> defaultErrorHandler=false, but it didn't fix my problem.

defaultErrorhandler only controls wether a exception falls through to the
caller of the parser or if it gets caught in the rule throwing it. Just
look at the differences in generated code.

> I tried
> putting EOF at the end of my start rule, but to no avail.  I tried to
> factor
> out the THREAT.CLASSES from the end of SURFACE_TO_AIR, also removing
> the final dot from the THREAT_CLASSES token.
> I changed the threatclass rule to :
> THREAT_CLASSES DOT ONETEN
> and then got a
> line 1:1: unexpected token: THREAT.CLASSES.
> error.
>
> I see why it is happening in the parser.  Here is the relevent java:
> public final void startSACLASS() throws RecognitionException,
> TokenStreamException {
>
> 		try {      // for error handling
....
> The problem is the if ( _cnt1421>=1).  If I remove that and make the
> code look like this:

The >= 1 is from the ()+ in the rule.

> 				if ((LA(1)==THREAT_CLASSES||LA(1)==SURFACE_TO_AIR)) {
> 					rules();
> 				}
> 				else {
> 					throw new NoViableAltException(LT(1), getFilename());
> 				}

What happens if you change the start rule to:

startSACLASS : (rules)* EOF ;

Or maybe even better remove these from the tokens section:

> 	THREAT_CLASSES="THREAT.CLASSES.";
> 	SURFACE_TO_AIR="NUMBER.OF.SURFACE.TO.AIR.THREAT.CLASSES";

And delete the IDENTIFIER rule. This is the one that makes all tokens with
letters and dots in them valid. Unless you catch invalid identifiers in the
parser (depends on the rest of your grammar looking at the _cnt1421 I
suspect your grammar is bigger in reality than this snippet)

and add rules:

THREAT_CLASSES : "THREAT.CLASSES.";
SURFACE_TO_AIR : "NUMBER.OF.SURFACE.TO.AIR.THREAT.CLASSES";

Then anything not matching these and the other lexer rules will bomb out.
Another option is to have the IDENTIFIER rule and use some extra checks on
invalid IDENTIFIER checks in the parser or maybe overload the literals
testing method of the lexer and have it bomb out (throw an exception) if no
literal is matched. Again this depends on your complete grammar either
solution has its advantages and its drawbacks.

Just for kicks make little executable around the lexer that calls nextToken
on it and dumps it to stdout. Then look at the tokens returned by the lexer
it should give you more of a feel what your parser sees and which errors
are generated by the parser and which ones by the lexer.

Hope this helps,

Ric
--
-----+++++*****************************************************+++++++++-------
    ---- Ric Klaren ----- j.klaren at utwente.nl ----- +31 53 4893722  ----
-----+++++*****************************************************+++++++++-------
  Quidquid latine dictum sit, altum viditur.
                 (Whatever is said in Latin sounds profound.)

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/