[antlr-interest] Specifying only part of the grammatical structure of an input file

Bill Mayfield antlrinterest at gmail.com
Tue Sep 2 07:41:01 PDT 2008


Hi,

I want to parse a part of an HTML file in order to extract
information. Take a look at my input file for example:

<html>
<head>...</head>
<body>
<all><sorts><of><crazy><tags><and><pcdata>

	<tr>
		<td>Terence Parr</td>
		<td>Knows ANTLR really well!</td>
	</tr>
	<tr>
		<td>Bill Mayfield</td>
		<td>Doesn't know ANTLR!</td>
	</tr>


<all><sorts><of><crazy><tags><and><pcdata>
</body>
</html>


So I'm only interested in recognizing the individual <tr></tr> rows in
order to extract the <td> labels. I've writting something that is
loosly based on
this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML


Can you tell me how I can make my parser ignore all the crazy tags and
pcdata before the pattern I would like to recognize? This is my
grammar and it's giving me this error: (201) The following
alternatives can never be matched: 1



grammar XMLParser;

@lexer::members {
    Boolean tagMode = false;
}



document	: 	.* pattern .* ;

pattern		:	(otr PCDATA?
					otd PCDATA?
						person=PCDATA
					ctd PCDATA?
					
					otd PCDATA?
						comment=PCDATA
					ctd PCDATA?
				ctr PCDATA?)* ;
		
	
/* BEGIN: specific tags */
otr			:	TAG_START_OPEN TR (attribute)* TAG_CLOSE;
ctr			:	TAG_END_OPEN TR (attribute)* TAG_CLOSE;
otd			:	TAG_START_OPEN TD (attribute)* TAG_CLOSE;
ctd			:	TAG_END_OPEN TD (attribute)* TAG_CLOSE;
einput		:	TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
oa			:	TAG_START_OPEN A (attribute)* TAG_CLOSE;
ca			:	TAG_END_OPEN A (attribute)* TAG_CLOSE;
/* END: specific tags */

startTag	:	TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
endTag		:	TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
emptyElement:	TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;
attribute	:	GENERIC_ID ATTR_EQ ATTR_VALUE ;



/*
  LEXER RULES
*/

TAG_START_OPEN 	:	'<' { tagMode = true; } ;
	
TAG_END_OPEN	:	'</' { tagMode = true; } ;
	
TAG_CLOSE		:	{ tagMode }? => '>' { tagMode = false; } ;
	
TAG_EMPTY_CLOSE	:	{ tagMode }?	=> '/>' { tagMode = false; } ;

ATTR_EQ			:	{ tagMode }? => '=' ;

ATTR_VALUE		:	{ tagMode }? =>	( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;

PCDATA			:	{ !tagMode }? => (~'<')+ ;
	
/* BEGIN: specific tags */
TR				:	{ tagMode }? => 'tr';
TD 				:	{ tagMode }? => 'td';
INPUT			:	{ tagMode }? => 'input';
A				:	{ tagMode }? => 'a';
/* END: Specific tags */

GENERIC_ID    	: 	{ tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;

fragment NAMECHAR:	 LETTER | DIGIT | '.' | '-' | '_' | ':' ;

fragment DIGIT	:    	'0'..'9' ;

fragment LETTER	:	'a'..'z' | 'A'..'Z' ;

WS				:	{ tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;}	;


More information about the antlr-interest mailing list