[antlr-interest] Simple lexer grammar doesn't match '\''

Mauro Pellicioli nightwolf at email.it
Wed Aug 29 03:40:55 PDT 2007


Hi, I have this simple grammar (it isn't necessary that you read the entire
grammar, the problem should be only at one point)


lexer grammar BookingList;

options {
	filter=true;
}

@lexer::header {
package grammatiche;
}

@lexer::members {
String str;
}


TABLE_RESULTS: '<table class="hotellist" cellspacing="0">' (WS|TR)+
'</table>';

fragment
TR: '<tr>' WS+ FIRST_TD WS+ SEC_TD WS+ '</tr>';

fragment
FIRST_TD: '<td>' WS+ '<a class="hotel"' (options {greedy=false;} : .)*
'</a>' WS+ '</td>';

fragment
SEC_TD: '<td>' (options {greedy=false;} : .)* '<h3>' WS+ LINK (options
{greedy=false;} : .)* '</h3>' WS+ ADDR WS+ DESCRIPTION WS+ '</td>';

fragment
ADDR:'<p class="address">' STRING {str=$STRING.text;} (STRONG WS '(')?
{System.out.println("Address: "+str+"\n"); LINK_GEN  ')</p>';

fragment       
STRONG:	'<strong>' STRING {str+=$STRING.text;} '</strong>';

fragment
DESCRIPTION: '<p>' (options {greedy=false;} : .)* '</p>';

fragment
LINK:'<a href="' STRING_LINK {System.out.println("Link:
"+$STRING_LINK.text); '">' STRING {System.out.println("Hotel:
"+$STRING.text);} '</a>';

fragment
LINK_GEN: '<a href='(options {greedy=false;} : .)* '</a>'; 

fragment
DIV_REVIEW: '<div class="reviewFloater">' (options {greedy=false;} : .)*
'</div>';
	
fragment
STRING: ( ('\u0020'..'\u003B') | '\u003D' | ('\u003F'..'\u007E')
|('\u0080'..'\u017F') )+;

fragment
STRING_LINK:	('a'..'z'|'A'..'Z'|'0'..'9'|'/'|'.'|'?'|'='|'_'|'%'|';'|'-')+;

fragment
INT:  ('0'..'9')+;

WS : ' ' | '\r' | '\n' |'\t' ;


And focus on this lexer rule:

fragment
LINK:'<a href="' STRING_LINK {System.out.println("Link:
"+$STRING_LINK.text);} '">' STRING {System.out.println("Hotel:
"+$STRING.text);} '</a>';

which gives a wrong output when it encounters this input:

<a
href="/hotel/us/enfant-plaza.html?sid=b02d5b4438247c402f4a43539dfc9d8c">L’Enfant
Plaza Hotel</a>

Output:

Link:
/hotel/us/enfant-plaza.html?label=short-index.htmlerrorc_search_in_invalid%3Dsi;sid=1892815e8db2e96caca618e2377948d8
Hotel: L

Instead of:

Link:/hotel/us/enfant-plaza.html?sid=b02d5b4438247c402f4a43539dfc9d8c
Hotel:L’Enfant Plaza Hotel 
Address:480 L'Enfant Plaza, SW, Washington (Washington DC) 


It seems that STRING rule fails when it encounters a ' char (hex value
0x27), but STRING has the correct range of chars.

The entire page on which I run the code is:
http://www.booking.com/searchresults.html?sid=b02d5b4438247c402f4a43539dfc9d8c;checkin_monthday=29;checkin_year_month=2007-8;checkout_monthday=30;checkout_year_month=2007-8;class_interval=1;offset=0;si=ai%2Cco%2Cci%2Cre;ss_all=0;city=20056368;radius=24

Thanks for help,
regards

PS I wanted to thank Johannes and Gavin for their help in a previous post
but I didn't want to flood the mailing list every time with new posts, how
do I reply?:-) 
 --
 Email.it, the professional e-mail, gratis per te: http://www.email.it/f
 
 Sponsor:
 250 biglietti da visita Gratis + 42 modelli e Etichette per Indirizzo
Gratis + Porta biglietti Gratis -Offerta limitata!
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=6785&d=20070829




More information about the antlr-interest mailing list