[antlr-interest] Specifying only part of the grammatical structure of an input file
Johannes Luber
jaluber at gmx.de
Tue Sep 2 08:14:11 PDT 2008
Bill Mayfield schrieb:
> Hi G.J.,
>
> Thanks for your feedback. I get an UnwantedTokenException and then
> some MismatchedTokenExceptins on the 'garbage' that comes before my
> pattern if I do what you are saying. The 'garbage' that comes after my
> pattern seems to be ignored...
>
> I don't understand how this is a lexical thing though... What I want
> to be able to say is:
>
> garbage_tokens pattern garbage_tokens
>
> where pattern is a certain well-defined structure in the html. I want
> to be able to ignore everything before and after the pattern. Maybe
> I'm waisting my time trying to do this with ANTLR?
You can use a separate lexer in the "filter=true;"-mode (combined
grammars won't work). It ignores anything for which it can't find token
definitions. You can skip at least the tags you are not interested in.
Johannes
>
> Thanks for helping though!
>
>
> Regards,
> Bill
>
>
> On Tue, Sep 2, 2008 at 4:54 PM, Gerard van de Glind
> <g.vandeglind at beinformed.nl> wrote:
>> Oh yeah,
>>
>> Change rule document into :
>> document : pattern ;
>>
>> Try to separate your lexer and grammar rules.
>> (So, don't do things like .* in a grammar rule.)
>>
>> Cheers, Gerard
>>
>> G.J. van de Glind
>> Software Engineer
>> Be Informed
>>
>>
>>
>>
>>
>> PS: If I change rule document I get no grammar errors but the tokens
>> after my pattern are not expected and I get an exception...
>>
>> document : .* pattern .* ;
>>
>> changed into
>>
>> document : .* pattern ;
>>
>>
>> Kind regards,
>> Bill
>>
>>
>>
>>
>> On Tue, Sep 2, 2008 at 4:41 PM, Bill Mayfield <antlrinterest at gmail.com> wrote:
>>> Hi,
>>>
>>> I want to parse a part of an HTML file in order to extract
>>> information. Take a look at my input file for example:
>>>
>>> <html>
>>> <head>...</head>
>>> <body>
>>> <all><sorts><of><crazy><tags><and><pcdata>
>>>
>>> <tr>
>>> <td>Terence Parr</td>
>>> <td>Knows ANTLR really well!</td>
>>> </tr>
>>> <tr>
>>> <td>Bill Mayfield</td>
>>> <td>Doesn't know ANTLR!</td>
>>> </tr>
>>>
>>>
>>> <all><sorts><of><crazy><tags><and><pcdata>
>>> </body>
>>> </html>
>>>
>>>
>>> So I'm only interested in recognizing the individual <tr></tr> rows in
>>> order to extract the <td> labels. I've writting something that is
>>> loosly based on
>>> this article -> http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
>>>
>>>
>>> Can you tell me how I can make my parser ignore all the crazy tags and
>>> pcdata before the pattern I would like to recognize? This is my
>>> grammar and it's giving me this error: (201) The following
>>> alternatives can never be matched: 1
>>>
>>>
>>>
>>> grammar XMLParser;
>>>
>>> @lexer::members {
>>> Boolean tagMode = false;
>>> }
>>>
>>>
>>>
>>> document : .* pattern .* ;
>>>
>>> pattern : (otr PCDATA?
>>> otd PCDATA?
>>> person=PCDATA
>>> ctd PCDATA?
>>>
>>> otd PCDATA?
>>> comment=PCDATA
>>> ctd PCDATA?
>>> ctr PCDATA?)* ;
>>>
>>>
>>> /* BEGIN: specific tags */
>>> otr : TAG_START_OPEN TR (attribute)* TAG_CLOSE;
>>> ctr : TAG_END_OPEN TR (attribute)* TAG_CLOSE;
>>> otd : TAG_START_OPEN TD (attribute)* TAG_CLOSE;
>>> ctd : TAG_END_OPEN TD (attribute)* TAG_CLOSE;
>>> einput : TAG_START_OPEN INPUT (attribute)* TAG_CLOSE;
>>> oa : TAG_START_OPEN A (attribute)* TAG_CLOSE;
>>> ca : TAG_END_OPEN A (attribute)* TAG_CLOSE;
>>> /* END: specific tags */
>>>
>>> startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ;
>>> endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ;
>>> emptyElement: TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ;
>>> attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ;
>>>
>>>
>>>
>>> /*
>>> LEXER RULES
>>> */
>>>
>>> TAG_START_OPEN : '<' { tagMode = true; } ;
>>>
>>> TAG_END_OPEN : '</' { tagMode = true; } ;
>>>
>>> TAG_CLOSE : { tagMode }? => '>' { tagMode = false; } ;
>>>
>>> TAG_EMPTY_CLOSE : { tagMode }? => '/>' { tagMode = false; } ;
>>>
>>> ATTR_EQ : { tagMode }? => '=' ;
>>>
>>> ATTR_VALUE : { tagMode }? => ( '"' (~'"')* '"' | '\'' (~'\'')* '\'' ) ;
>>>
>>> PCDATA : { !tagMode }? => (~'<')+ ;
>>>
>>> /* BEGIN: specific tags */
>>> TR : { tagMode }? => 'tr';
>>> TD : { tagMode }? => 'td';
>>> INPUT : { tagMode }? => 'input';
>>> A : { tagMode }? => 'a';
>>> /* END: Specific tags */
>>>
>>> GENERIC_ID : { tagMode }? => ( LETTER | '_' | ':') (NAMECHAR)* ;
>>>
>>> fragment NAMECHAR: LETTER | DIGIT | '.' | '-' | '_' | ':' ;
>>>
>>> fragment DIGIT : '0'..'9' ;
>>>
>>> fragment LETTER : 'a'..'z' | 'A'..'Z' ;
>>>
>>> WS : { tagMode }? => (' '|'\r'|'\t'|'\u000C'|'\n') { $channel=99;} ;
>>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
More information about the antlr-interest
mailing list