[antlr-interest] parsing unstructured text with xml-type tags.

Bharath Sundararaman bharath at starthis.com
Mon Jun 7 08:15:27 PDT 2004


Hi Priyank,

>>Here are my problems...
>>1) PLAIN_TEXT consumes everything

In the ANTLR documentation (you can find it on www.antlr.org), there is a
section titled "ANTLR masquerading as SED". You will read about the "greedy"
option and that will help you solve problem #1.

>> 2) since this is unstructures text, how do I know where to stop (EOF)

You had mentioned that PLAIN_TEXT will consume whatever, until it finds a
"<" (start tag). So, if your text didn't end with a tag, your INPUT_TEXT
should be able to eat your input. If your text ends with a "</mytag>", your
ENDTAG will successfully parse it and stop. Look at this page to understand
how EOF works, in ANTLR: http://www.antlr.org/doc/metalang.html (Third
paragraph in the first page talks about EOF).

>>3) I also need to support HTML tags in the PLAIN_TEXT (I have to consume
them in PLAIN_TEXT)

Basically, this is a subset of your original problem. If PLAIN_TEXT is of
the form:

This is <b>plain</b> text.

It is similar to your original template, except that it has HTML tags
embedded in it, instead of XML tags.

__________

I'm a newbie too, to ANTLR. I had similar problems and reading these
sections helped me solve them and understand how EOF worked. I hope my
pointers help you find your answers.

Bharath
~ Give me the tool and I shall move the earth ~





-----Original Message-----
From: Priyank RASTOGI [mailto:priyank at osellus.com] 
Sent: Monday, June 07, 2004 5:05 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] parsing unstructured text with xml-type tags.


Hi,

Sorry if this is very trivial but I could not find a solution in any 
of the examples.

I am writing a parser for a template that contains unstructured text 
with embedded XML-type tags. So one example of such a template is

This is a test template created by <mytag>some text here</mytag>. 
Thanks for your time.

In the lexer, I have defined rules like 
PLAIN_TEXT: consume whatever you see till < tag
STARTTAG: copied from xml parser in examples
ENDTAG: copied from xml parser in examples

So the stream of tokes i am expecting is
PLAIN_TEXT STARTTAG PLAIN_TEXT ENDTAG

Here are my problems...
1) PLAIN_TEXT consumes everything
2) since this is unstructures text, how do I know where to stop (EOF)
3) I also need to support HTML tags in the PLAIN_TEXT (I have to 
consume them in PLAIN_TEXT)

I am stuck how to go about it.

Any pointers would be greatly appreciated.

Thanks
Priyank




 
Yahoo! Groups Links



 




 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list