[antlr-interest] Specifying only part of the grammatical structure of an input file

Johannes Luber jaluber at gmx.de
Tue Sep 2 08:14:11 PDT 2008


Bill Mayfield schrieb:
> Hi G.J.,
> 
> Thanks for your feedback. I get an UnwantedTokenException and then
> some MismatchedTokenExceptins on the 'garbage' that comes before my
> pattern if I do what you are saying. The 'garbage' that comes after my
> pattern seems to be ignored...
> 
> I don't understand how this is a lexical thing though... What I want
> to be able to say is:
> 
> garbage_tokens     pattern     garbage_tokens
> 
> where pattern is a certain well-defined structure in the html. I want
> to be able to ignore everything before and after the pattern. Maybe
> I'm waisting my time trying to do this with ANTLR?

You can use a separate lexer in the "filter=true;"-mode (combined
grammars won't work). It ignores anything for which it can't find token
definitions. You can skip at least the tags you are not interested in.

Johannes
> 
> Thanks for helping though!
> 
> 
> Regards,
> Bill
> 
> 
> On Tue, Sep 2, 2008 at 4:54 PM, Gerard van de Glind
> <g.vandeglind at beinformed.nl> wrote:
>> Oh yeah,
>>
>> Change rule document into :
>> document        :       pattern ;
>>
>> Try to separate your lexer and grammar rules.
>> (So, don't do things like .* in a grammar rule.)
>>
>> Cheers, Gerard
>>
>> G.J. van de Glind
>> Software Engineer
>> Be Informed
>>
>>
>>
>>
>>
>> PS: If I change rule document I get no grammar errors but the tokens
>> after my pattern are not expected and I get an exception...
>>
>> document        :       .* pattern .* ;
>>
>> changed into
>>
>> document        :       .* pattern ;
>>
>>
>> Kind regards,
>> Bill
>>
>>
>>
>>
>> On Tue, Sep 2, 2008 at 4:41 PM, Bill Mayfield <antlrinterest at gmail.com> wrote:
>>> Hi,
>>>
>>> I want to parse a part of an HTML file in order to extract
>>> information. Take a look at my input file for example:
>>>
>>> <html>
>>> <head>...</head>
>>> <body>
>>> <all><sorts><of><crazy><tags><and><pcdata>
>>>
>>>        <tr>
>>>                <td>Terence Parr</td>
>>>                <td>Knows ANTLR really well!</td>
>>>        </tr>
>>>        <tr>
>>>                <td>Bill Mayfield</td>
>>>                <td>Doesn't know ANTLR!</td>
>>>        </tr>
>>>
>>>
>>> <all><sorts><of><crazy><tags><and><pcdata>
>>> </body>
>>> </html>
>>>
>>>
>>> So I'm only interested in recognizing the individual <tr></tr> rows in
>>> order to extract the <td> labels. I've writting something that is
>>> loosly based on
>>> this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
>>>
>>>
>>> Can you tell me how I can make my parser ignore all the crazy tags and
>>> pcdata before the pattern I would like to recognize? This is my
>>> grammar and it's giving me this error: (201) The following
>>> alternatives can never be matched: 1
>>>
>>>
>>>
>>> grammar XMLParser;
>>>
>>> @lexer::members {
>>>    Boolean tagMode = false;
>>> }
>>>
>>>
>>>
>>> document        :       .* pattern .* ;
>>>
>>> pattern         :       (otr PCDATA?
>>>                                        otd PCDATA?
>>>                                                person=PCDATA
>>>                                        ctd PCDATA?
>>>
>>>                                        otd PCDATA?
>>>                                                comment=PCDATA
>>>                                        ctd PCDATA?
>>>                                ctr PCDATA?)* ;
>>>
>>>
>>> /* BEGIN: specific tags */
>>> otr                     :       TAG_START_OPEN TR (attribute)* TAG_CLOSE;
>>> ctr                     :       TAG_END_OPEN TR (attribute)* TAG_CLOSE;
>>> otd                     :       TAG_START_OPEN TD (attribute)* TAG_CLOSE;
>>> ctd                     :       TAG_END_OPEN TD (attribute)* TAG_CLOSE;
>>> einput          :       TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
>>> oa                      :       TAG_START_OPEN A (attribute)* TAG_CLOSE;
>>> ca                      :       TAG_END_OPEN A (attribute)* TAG_CLOSE;
>>> /* END: specific tags */
>>>
>>> startTag        :       TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
>>> endTag          :       TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
>>> emptyElement:   TAG_START_OPEN GENERIC_ID  (attribute)* TAG_EMPTY_CLOSE ;
>>> attribute       :       GENERIC_ID ATTR_EQ ATTR_VALUE ;
>>>
>>>
>>>
>>> /*
>>>  LEXER RULES
>>> */
>>>
>>> TAG_START_OPEN  :       '<' { tagMode = true; } ;
>>>
>>> TAG_END_OPEN    :       '</' { tagMode = true; } ;
>>>
>>> TAG_CLOSE               :       { tagMode }? => '>' { tagMode = false; } ;
>>>
>>> TAG_EMPTY_CLOSE :       { tagMode }?    => '/>' { tagMode = false; } ;
>>>
>>> ATTR_EQ                 :       { tagMode }? => '=' ;
>>>
>>> ATTR_VALUE              :       { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
>>>
>>> PCDATA                  :       { !tagMode }? => (~'<')+ ;
>>>
>>> /* BEGIN: specific tags */
>>> TR                              :       { tagMode }? => 'tr';
>>> TD                              :       { tagMode }? => 'td';
>>> INPUT                   :       { tagMode }? => 'input';
>>> A                               :       { tagMode }? => 'a';
>>> /* END: Specific tags */
>>>
>>> GENERIC_ID      :       { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
>>>
>>> fragment NAMECHAR:       LETTER | DIGIT | '.' | '-' | '_' | ':' ;
>>>
>>> fragment DIGIT  :       '0'..'9' ;
>>>
>>> fragment LETTER :       'a'..'z' | 'A'..'Z' ;
>>>
>>> WS                              :       { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;}   ;
>>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>>
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> 



More information about the antlr-interest mailing list