[antlr-interest] Specifying only part of the grammatical structure of an input file
Gerard van de Glind
g.vandeglind at beinformed.nl
Tue Sep 2 07:50:01 PDT 2008
Hi Bill and all,
I guess you could try the following as the last lexer rule in your grammar:
IGNORE : . {$channel=HIDDEN;};
Regards,
G.J. van de Glind
Software Engineer
Be Informed
Hi,
I want to parse a part of an HTML file in order to extract
information. Take a look at my input file for example:
<html>
<head>...</head>
<body>
<all><sorts><of><crazy><tags><and><pcdata>
<tr>
<td>Terence Parr</td>
<td>Knows ANTLR really well!</td>
</tr>
<tr>
<td>Bill Mayfield</td>
<td>Doesn't know ANTLR!</td>
</tr>
<all><sorts><of><crazy><tags><and><pcdata>
</body>
</html>
So I'm only interested in recognizing the individual <tr></tr> rows in
order to extract the <td> labels. I've writting something that is
loosly based on
this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
Can you tell me how I can make my parser ignore all the crazy tags and
pcdata before the pattern I would like to recognize? This is my
grammar and it's giving me this error: (201) The following
alternatives can never be matched: 1
grammar XMLParser;
@lexer::members {
Boolean tagMode = false;
}
document : .* pattern .* ;
pattern : (otr PCDATA?
otd PCDATA?
person=PCDATA
ctd PCDATA?
otd PCDATA?
comment=PCDATA
ctd PCDATA?
ctr PCDATA?)* ;
/* BEGIN: specific tags */
otr : TAG_START_OPEN TR (attribute)* TAG_CLOSE;
ctr : TAG_END_OPEN TR (attribute)* TAG_CLOSE;
otd : TAG_START_OPEN TD (attribute)* TAG_CLOSE;
ctd : TAG_END_OPEN TD (attribute)* TAG_CLOSE;
einput : TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
oa : TAG_START_OPEN A (attribute)* TAG_CLOSE;
ca : TAG_END_OPEN A (attribute)* TAG_CLOSE;
/* END: specific tags */
startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
emptyElement: TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ;
attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ;
/*
LEXER RULES
*/
TAG_START_OPEN : '<' { tagMode = true; } ;
TAG_END_OPEN : '</' { tagMode = true; } ;
TAG_CLOSE : { tagMode }? => '>' { tagMode = false; } ;
TAG_EMPTY_CLOSE : { tagMode }? => '/>' { tagMode = false; } ;
ATTR_EQ : { tagMode }? => '=' ;
ATTR_VALUE : { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
PCDATA : { !tagMode }? => (~'<')+ ;
/* BEGIN: specific tags */
TR : { tagMode }? => 'tr';
TD : { tagMode }? => 'td';
INPUT : { tagMode }? => 'input';
A : { tagMode }? => 'a';
/* END: Specific tags */
GENERIC_ID : { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
fragment NAMECHAR: LETTER | DIGIT | '.' | '-' | '_' | ':' ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
WS : { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;} ;
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list