[antlr-interest] Specifying only part of the grammatical structure of an input file

Gerard van de Glind g.vandeglind at beinformed.nl
Tue Sep 2 07:50:01 PDT 2008


Hi Bill and all,

I guess you could try the following as the last lexer rule in your grammar:
        IGNORE         : .      {$channel=HIDDEN;};

Regards,

G.J. van de Glind
Software Engineer
Be Informed



Hi,

I want to parse a part of an HTML file in order to extract
information. Take a look at my input file for example:

<html>
<head>...</head>
<body>
<all><sorts><of><crazy><tags><and><pcdata>

        <tr>
                <td>Terence Parr</td>
                <td>Knows ANTLR really well!</td>
        </tr>
        <tr>
                <td>Bill Mayfield</td>
                <td>Doesn't know ANTLR!</td>
        </tr>


<all><sorts><of><crazy><tags><and><pcdata>
</body>
</html>


So I'm only interested in recognizing the individual <tr></tr> rows in
order to extract the <td> labels. I've writting something that is
loosly based on
this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML


Can you tell me how I can make my parser ignore all the crazy tags and
pcdata before the pattern I would like to recognize? This is my
grammar and it's giving me this error: (201) The following
alternatives can never be matched: 1



grammar XMLParser;

@lexer::members {
    Boolean tagMode = false;
}



document        :       .* pattern .* ;

pattern         :       (otr PCDATA?
                                        otd PCDATA?
                                                person=PCDATA
                                        ctd PCDATA?

                                        otd PCDATA?
                                                comment=PCDATA
                                        ctd PCDATA?
                                ctr PCDATA?)* ;


/* BEGIN: specific tags */
otr                     :       TAG_START_OPEN TR (attribute)* TAG_CLOSE;
ctr                     :       TAG_END_OPEN TR (attribute)* TAG_CLOSE;
otd                     :       TAG_START_OPEN TD (attribute)* TAG_CLOSE;
ctd                     :       TAG_END_OPEN TD (attribute)* TAG_CLOSE;
einput          :       TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
oa                      :       TAG_START_OPEN A (attribute)* TAG_CLOSE;
ca                      :       TAG_END_OPEN A (attribute)* TAG_CLOSE;
/* END: specific tags */

startTag        :       TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
endTag          :       TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
emptyElement:   TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;
attribute       :       GENERIC_ID ATTR_EQ ATTR_VALUE ;



/*
  LEXER RULES
*/

TAG_START_OPEN  :       '<' { tagMode = true; } ;

TAG_END_OPEN    :       '</' { tagMode = true; } ;

TAG_CLOSE               :       { tagMode }? => '>' { tagMode = false; } ;

TAG_EMPTY_CLOSE :       { tagMode }?    => '/>' { tagMode = false; } ;

ATTR_EQ                 :       { tagMode }? => '=' ;

ATTR_VALUE              :       { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;

PCDATA                  :       { !tagMode }? => (~'<')+ ;

/* BEGIN: specific tags */
TR                              :       { tagMode }? => 'tr';
TD                              :       { tagMode }? => 'td';
INPUT                   :       { tagMode }? => 'input';
A                               :       { tagMode }? => 'a';
/* END: Specific tags */

GENERIC_ID      :       { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;

fragment NAMECHAR:       LETTER | DIGIT | '.' | '-' | '_' | ':' ;

fragment DIGIT  :       '0'..'9' ;

fragment LETTER :       'a'..'z' | 'A'..'Z' ;

WS                              :       { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;}   ;

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address



More information about the antlr-interest mailing list