[antlr-interest] Specifying only part of the grammatical structure of an input file

Tue Sep 2 07:51:19 PDT 2008

PS: If I change rule document I get no grammar errors but the tokens
after my pattern are not expected and I get an exception...

document        :       .* pattern .* ;

changed into

document        :       .* pattern ;

Kind regards,
Bill

On Tue, Sep 2, 2008 at 4:41 PM, Bill Mayfield <antlrinterest at gmail.com> wrote:
> Hi,
>
> I want to parse a part of an HTML file in order to extract
> information. Take a look at my input file for example:
>
> <html>
> <head>...</head>
> <body>
> <all><sorts><of><crazy><tags><and><pcdata>
>
>        <tr>
>                <td>Terence Parr</td>
>                <td>Knows ANTLR really well!</td>
>        </tr>
>        <tr>
>                <td>Bill Mayfield</td>
>                <td>Doesn't know ANTLR!</td>
>        </tr>
>
>
> <all><sorts><of><crazy><tags><and><pcdata>
> </body>
> </html>
>
>
> So I'm only interested in recognizing the individual <tr></tr> rows in
> order to extract the <td> labels. I've writting something that is
> loosly based on
> this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
>
>
> Can you tell me how I can make my parser ignore all the crazy tags and
> pcdata before the pattern I would like to recognize? This is my
> grammar and it's giving me this error: (201) The following
> alternatives can never be matched: 1
>
>
>
> grammar XMLParser;
>
> @lexer::members {
>    Boolean tagMode = false;
> }
>
>
>
> document        :       .* pattern .* ;
>
> pattern         :       (otr PCDATA?
>                                        otd PCDATA?
>                                                person=PCDATA
>                                        ctd PCDATA?
>
>                                        otd PCDATA?
>                                                comment=PCDATA
>                                        ctd PCDATA?
>                                ctr PCDATA?)* ;
>
>
> /* BEGIN: specific tags */
> otr                     :       TAG_START_OPEN TR (attribute)* TAG_CLOSE;
> ctr                     :       TAG_END_OPEN TR (attribute)* TAG_CLOSE;
> otd                     :       TAG_START_OPEN TD (attribute)* TAG_CLOSE;
> ctd                     :       TAG_END_OPEN TD (attribute)* TAG_CLOSE;
> einput          :       TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
> oa                      :       TAG_START_OPEN A (attribute)* TAG_CLOSE;
> ca                      :       TAG_END_OPEN A (attribute)* TAG_CLOSE;
> /* END: specific tags */
>
> startTag        :       TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
> endTag          :       TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
> emptyElement:   TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;
> attribute       :       GENERIC_ID ATTR_EQ ATTR_VALUE ;
>
>
>
> /*
>  LEXER RULES
> */
>
> TAG_START_OPEN  :       '<' { tagMode = true; } ;
>
> TAG_END_OPEN    :       '</' { tagMode = true; } ;
>
> TAG_CLOSE               :       { tagMode }? => '>' { tagMode = false; } ;
>
> TAG_EMPTY_CLOSE :       { tagMode }?    => '/>' { tagMode = false; } ;
>
> ATTR_EQ                 :       { tagMode }? => '=' ;
>
> ATTR_VALUE              :       { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
>
> PCDATA                  :       { !tagMode }? => (~'<')+ ;
>
> /* BEGIN: specific tags */
> TR                              :       { tagMode }? => 'tr';
> TD                              :       { tagMode }? => 'td';
> INPUT                   :       { tagMode }? => 'input';
> A                               :       { tagMode }? => 'a';
> /* END: Specific tags */
>
> GENERIC_ID      :       { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
>
> fragment NAMECHAR:       LETTER | DIGIT | '.' | '-' | '_' | ':' ;
>
> fragment DIGIT  :       '0'..'9' ;
>
> fragment LETTER :       'a'..'z' | 'A'..'Z' ;
>
> WS                              :       { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;}   ;
>