[antlr-interest] Specifying only part of the grammatical structure of an input file
Bill Mayfield
antlrinterest at gmail.com
Tue Sep 2 07:51:19 PDT 2008
PS: If I change rule document I get no grammar errors but the tokens
after my pattern are not expected and I get an exception...
document : .* pattern .* ;
changed into
document : .* pattern ;
Kind regards,
Bill
On Tue, Sep 2, 2008 at 4:41 PM, Bill Mayfield <antlrinterest at gmail.com> wrote:
> Hi,
>
> I want to parse a part of an HTML file in order to extract
> information. Take a look at my input file for example:
>
> <html>
> <head>...</head>
> <body>
> <all><sorts><of><crazy><tags><and><pcdata>
>
> <tr>
> <td>Terence Parr</td>
> <td>Knows ANTLR really well!</td>
> </tr>
> <tr>
> <td>Bill Mayfield</td>
> <td>Doesn't know ANTLR!</td>
> </tr>
>
>
> <all><sorts><of><crazy><tags><and><pcdata>
> </body>
> </html>
>
>
> So I'm only interested in recognizing the individual <tr></tr> rows in
> order to extract the <td> labels. I've writting something that is
> loosly based on
> this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
>
>
> Can you tell me how I can make my parser ignore all the crazy tags and
> pcdata before the pattern I would like to recognize? This is my
> grammar and it's giving me this error: (201) The following
> alternatives can never be matched: 1
>
>
>
> grammar XMLParser;
>
> @lexer::members {
> Boolean tagMode = false;
> }
>
>
>
> document : .* pattern .* ;
>
> pattern : (otr PCDATA?
> otd PCDATA?
> person=PCDATA
> ctd PCDATA?
>
> otd PCDATA?
> comment=PCDATA
> ctd PCDATA?
> ctr PCDATA?)* ;
>
>
> /* BEGIN: specific tags */
> otr : TAG_START_OPEN TR (attribute)* TAG_CLOSE;
> ctr : TAG_END_OPEN TR (attribute)* TAG_CLOSE;
> otd : TAG_START_OPEN TD (attribute)* TAG_CLOSE;
> ctd : TAG_END_OPEN TD (attribute)* TAG_CLOSE;
> einput : TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
> oa : TAG_START_OPEN A (attribute)* TAG_CLOSE;
> ca : TAG_END_OPEN A (attribute)* TAG_CLOSE;
> /* END: specific tags */
>
> startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
> endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
> emptyElement: TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ;
> attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ;
>
>
>
> /*
> LEXER RULES
> */
>
> TAG_START_OPEN : '<' { tagMode = true; } ;
>
> TAG_END_OPEN : '</' { tagMode = true; } ;
>
> TAG_CLOSE : { tagMode }? => '>' { tagMode = false; } ;
>
> TAG_EMPTY_CLOSE : { tagMode }? => '/>' { tagMode = false; } ;
>
> ATTR_EQ : { tagMode }? => '=' ;
>
> ATTR_VALUE : { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
>
> PCDATA : { !tagMode }? => (~'<')+ ;
>
> /* BEGIN: specific tags */
> TR : { tagMode }? => 'tr';
> TD : { tagMode }? => 'td';
> INPUT : { tagMode }? => 'input';
> A : { tagMode }? => 'a';
> /* END: Specific tags */
>
> GENERIC_ID : { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
>
> fragment NAMECHAR: LETTER | DIGIT | '.' | '-' | '_' | ':' ;
>
> fragment DIGIT : '0'..'9' ;
>
> fragment LETTER : 'a'..'z' | 'A'..'Z' ;
>
> WS : { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;} ;
>
More information about the antlr-interest
mailing list