[antlr-interest] Specifying only part of the grammatical structure of an input file

Tue Sep 2 08:03:30 PDT 2008

Hi G.J.,

Thanks for your feedback. I get an UnwantedTokenException and then
some MismatchedTokenExceptins on the 'garbage' that comes before my
pattern if I do what you are saying. The 'garbage' that comes after my
pattern seems to be ignored...

I don't understand how this is a lexical thing though... What I want
to be able to say is:

garbage_tokens     pattern     garbage_tokens

where pattern is a certain well-defined structure in the html. I want
to be able to ignore everything before and after the pattern. Maybe
I'm waisting my time trying to do this with ANTLR?

Thanks for helping though!

Regards,
Bill

On Tue, Sep 2, 2008 at 4:54 PM, Gerard van de Glind
<g.vandeglind at beinformed.nl> wrote:
> Oh yeah,
>
> Change rule document into :
> document        :       pattern ;
>
> Try to separate your lexer and grammar rules.
> (So, don't do things like .* in a grammar rule.)
>
> Cheers, Gerard
>
> G.J. van de Glind
> Software Engineer
> Be Informed
>
>
>
>
>
> PS: If I change rule document I get no grammar errors but the tokens
> after my pattern are not expected and I get an exception...
>
> document        :       .* pattern .* ;
>
> changed into
>
> document        :       .* pattern ;
>
>
> Kind regards,
> Bill
>
>
>
>
> On Tue, Sep 2, 2008 at 4:41 PM, Bill Mayfield <antlrinterest at gmail.com> wrote:
>> Hi,
>>
>> I want to parse a part of an HTML file in order to extract
>> information. Take a look at my input file for example:
>>
>> <html>
>> <head>...</head>
>> <body>
>> <all><sorts><of><crazy><tags><and><pcdata>
>>
>>        <tr>
>>                <td>Terence Parr</td>
>>                <td>Knows ANTLR really well!</td>
>>        </tr>
>>        <tr>
>>                <td>Bill Mayfield</td>
>>                <td>Doesn't know ANTLR!</td>
>>        </tr>
>>
>>
>> <all><sorts><of><crazy><tags><and><pcdata>
>> </body>
>> </html>
>>
>>
>> So I'm only interested in recognizing the individual <tr></tr> rows in
>> order to extract the <td> labels. I've writting something that is
>> loosly based on
>> this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
>>
>>
>> Can you tell me how I can make my parser ignore all the crazy tags and
>> pcdata before the pattern I would like to recognize? This is my
>> grammar and it's giving me this error: (201) The following
>> alternatives can never be matched: 1
>>
>>
>>
>> grammar XMLParser;
>>
>> @lexer::members {
>>    Boolean tagMode = false;
>> }
>>
>>
>>
>> document        :       .* pattern .* ;
>>
>> pattern         :       (otr PCDATA?
>>                                        otd PCDATA?
>>                                                person=PCDATA
>>                                        ctd PCDATA?
>>
>>                                        otd PCDATA?
>>                                                comment=PCDATA
>>                                        ctd PCDATA?
>>                                ctr PCDATA?)* ;
>>
>>
>> /* BEGIN: specific tags */
>> otr                     :       TAG_START_OPEN TR (attribute)* TAG_CLOSE;
>> ctr                     :       TAG_END_OPEN TR (attribute)* TAG_CLOSE;
>> otd                     :       TAG_START_OPEN TD (attribute)* TAG_CLOSE;
>> ctd                     :       TAG_END_OPEN TD (attribute)* TAG_CLOSE;
>> einput          :       TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
>> oa                      :       TAG_START_OPEN A (attribute)* TAG_CLOSE;
>> ca                      :       TAG_END_OPEN A (attribute)* TAG_CLOSE;
>> /* END: specific tags */
>>
>> startTag        :       TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
>> endTag          :       TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
>> emptyElement:   TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;
>> attribute       :       GENERIC_ID ATTR_EQ ATTR_VALUE ;
>>
>>
>>
>> /*
>>  LEXER RULES
>> */
>>
>> TAG_START_OPEN  :       '<' { tagMode = true; } ;
>>
>> TAG_END_OPEN    :       '</' { tagMode = true; } ;
>>
>> TAG_CLOSE               :       { tagMode }? => '>' { tagMode = false; } ;
>>
>> TAG_EMPTY_CLOSE :       { tagMode }?    => '/>' { tagMode = false; } ;
>>
>> ATTR_EQ                 :       { tagMode }? => '=' ;
>>
>> ATTR_VALUE              :       { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
>>
>> PCDATA                  :       { !tagMode }? => (~'<')+ ;
>>
>> /* BEGIN: specific tags */
>> TR                              :       { tagMode }? => 'tr';
>> TD                              :       { tagMode }? => 'td';
>> INPUT                   :       { tagMode }? => 'input';
>> A                               :       { tagMode }? => 'a';
>> /* END: Specific tags */
>>
>> GENERIC_ID      :       { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
>>
>> fragment NAMECHAR:       LETTER | DIGIT | '.' | '-' | '_' | ':' ;
>>
>> fragment DIGIT  :       '0'..'9' ;
>>
>> fragment LETTER :       'a'..'z' | 'A'..'Z' ;
>>
>> WS                              :       { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;}   ;
>>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
>