[antlr-interest] Newbie question:using lexer grammar

Johannes Luber jaluber at gmx.de
Sun Aug 26 06:40:24 PDT 2007


Mauro Pellicioli wrote:
> For example, I don't know why this grammar:
> 
> 
> document:	DOCUMENT;
> 
> DOCUMENT: '<' (options {greedy=false;} : .)* ALTERNATIVE (options
> {greedy=false;} : .)* '</html>' WS*;
> 
> ALTERNATIVE:	'Destination not found' {System.out.println("OK 1");}
>                 | 'Please make your choice by clicking on the destination
> name below' {System.out.println("OK 2");}
> 		| 'Hotels found' {System.out.println("OK 3");};
>      
> WS : ' ' | '\r' | '\n' |'\t' ;	
> 
> on this html page:
> 
> http://www.booking.com/searchresults.html?return_url=http%3A%2F%2Fwww.booking.com%2Fsearchresults.html&found_addresses=&error_url=http%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dshort-index.htmlerrorc_search_in_invalid%253Dsi%3Bsid%3D19361acb03c3fe7f1def5374b839e6a9%3B&label=short-index.htmlerrorc_search_in_invalid%3Dsi&sid=19361acb03c3fe7f1def5374b839e6a9&order=&addressAddress=&addressCity=&addressZIP=&addressCountry=&si=ai%2Cco%2Cci%2Cre&ss=xyz&checkin_monthday=25&checkin_year_month=2007-8&checkout_monthday=26&checkout_year_month=2007-8&radius=
> 
> gives me the error:
> 
> line 1:0 mismatched input '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
> Transitional//EN"\n  "http://www.w3.org/T.....
> 
> even if the test prints out the correct kind of page. 
> 
> Is there an error in the grammar?

Testing your grammar in the debugger on the saved page doesn't result in
any error. It may be possible that your version of the saved file has
something else at the begin of the text file than '<'. If you saved it
in UTF 16 then there will certainly a BOM (Byte Order Mark) at the
beginning, but the Java file reader may swallow that already. Or other
whitespace may have been found. The general error recovery strategy of
ANTLR is to delete superfluous tokens and to give an error message like
you seen.

The best method is to remove the '<', as it isn't interesting to you:

DOCUMENT: (options {greedy=false;} : .)* ALTERNATIVE (options
{greedy=false;} : .)* '</html>' WS*;

The removal of the trailing part causes a mismatched exception with EOF
- a strange error. Another point I noticed is the hardcoding of the
search text. Any change on the displayed on the webpage can cause your
scanner to fail. How about searching for HTML classes like "<h1
class="sorth1">" near the displayed text? Those constructs won't change
as often as the displayed text.

In any case, your goal is suitable for the filter option. The filter
option causes the lexer to skip any input, which doesn't match a token.
In your case, you would need three tokens, one for each alternative, and
a parser rule, which matches one occurrence of those tokens.

grammar MauroTest;

options {
	filter=true;
}

ALTERNATIVE1:	'Destination not found';
ALTERNATIVE2:	'Please make your choice by clicking on the destination
name below';
ALTERNATIVE3:	'Hotels found';

document
	:	ALTERNATIVE1 {System.out.println("OK 1");}
	|	ALTERNATIVE2 {System.out.println("OK 2");}
	|	ALTERNATIVE3 {System.out.println("OK 3");}
	;

For whatever reason, there was no output of "OK 1", despite ALTERNATIVE1
being recognized. Maybe one has to use separate lexer and parser grammars...

Best regards,
Johannes Luber


More information about the antlr-interest mailing list