[antlr-interest] Simple lexer grammar doesn't match '\''
Mauro Pellicioli
nightwolf at email.it
Wed Aug 29 03:40:55 PDT 2007
Hi, I have this simple grammar (it isn't necessary that you read the entire
grammar, the problem should be only at one point)
lexer grammar BookingList;
options {
filter=true;
}
@lexer::header {
package grammatiche;
}
@lexer::members {
String str;
}
TABLE_RESULTS: '<table class="hotellist" cellspacing="0">' (WS|TR)+
'</table>';
fragment
TR: '<tr>' WS+ FIRST_TD WS+ SEC_TD WS+ '</tr>';
fragment
FIRST_TD: '<td>' WS+ '<a class="hotel"' (options {greedy=false;} : .)*
'</a>' WS+ '</td>';
fragment
SEC_TD: '<td>' (options {greedy=false;} : .)* '<h3>' WS+ LINK (options
{greedy=false;} : .)* '</h3>' WS+ ADDR WS+ DESCRIPTION WS+ '</td>';
fragment
ADDR:'<p class="address">' STRING {str=$STRING.text;} (STRONG WS '(')?
{System.out.println("Address: "+str+"\n"); LINK_GEN ')</p>';
fragment
STRONG: '<strong>' STRING {str+=$STRING.text;} '</strong>';
fragment
DESCRIPTION: '<p>' (options {greedy=false;} : .)* '</p>';
fragment
LINK:'<a href="' STRING_LINK {System.out.println("Link:
"+$STRING_LINK.text); '">' STRING {System.out.println("Hotel:
"+$STRING.text);} '</a>';
fragment
LINK_GEN: '<a href='(options {greedy=false;} : .)* '</a>';
fragment
DIV_REVIEW: '<div class="reviewFloater">' (options {greedy=false;} : .)*
'</div>';
fragment
STRING: ( ('\u0020'..'\u003B') | '\u003D' | ('\u003F'..'\u007E')
|('\u0080'..'\u017F') )+;
fragment
STRING_LINK: ('a'..'z'|'A'..'Z'|'0'..'9'|'/'|'.'|'?'|'='|'_'|'%'|';'|'-')+;
fragment
INT: ('0'..'9')+;
WS : ' ' | '\r' | '\n' |'\t' ;
And focus on this lexer rule:
fragment
LINK:'<a href="' STRING_LINK {System.out.println("Link:
"+$STRING_LINK.text);} '">' STRING {System.out.println("Hotel:
"+$STRING.text);} '</a>';
which gives a wrong output when it encounters this input:
<a
href="/hotel/us/enfant-plaza.html?sid=b02d5b4438247c402f4a43539dfc9d8c">LEnfant
Plaza Hotel</a>
Output:
Link:
/hotel/us/enfant-plaza.html?label=short-index.htmlerrorc_search_in_invalid%3Dsi;sid=1892815e8db2e96caca618e2377948d8
Hotel: L
Instead of:
Link:/hotel/us/enfant-plaza.html?sid=b02d5b4438247c402f4a43539dfc9d8c
Hotel:LEnfant Plaza Hotel
Address:480 L'Enfant Plaza, SW, Washington (Washington DC)
It seems that STRING rule fails when it encounters a ' char (hex value
0x27), but STRING has the correct range of chars.
The entire page on which I run the code is:
http://www.booking.com/searchresults.html?sid=b02d5b4438247c402f4a43539dfc9d8c;checkin_monthday=29;checkin_year_month=2007-8;checkout_monthday=30;checkout_year_month=2007-8;class_interval=1;offset=0;si=ai%2Cco%2Cci%2Cre;ss_all=0;city=20056368;radius=24
Thanks for help,
regards
PS I wanted to thank Johannes and Gavin for their help in a previous post
but I didn't want to flood the mailing list every time with new posts, how
do I reply?:-)
--
Email.it, the professional e-mail, gratis per te: http://www.email.it/f
Sponsor:
250 biglietti da visita Gratis + 42 modelli e Etichette per Indirizzo
Gratis + Porta biglietti Gratis -Offerta limitata!
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=6785&d=20070829
More information about the antlr-interest
mailing list