[antlr-interest] parsing unstructured text with xml-type tags.
Bharath Sundararaman
bharath at starthis.com
Mon Jun 7 08:15:27 PDT 2004
Hi Priyank,
>>Here are my problems...
>>1) PLAIN_TEXT consumes everything
In the ANTLR documentation (you can find it on www.antlr.org), there is a
section titled "ANTLR masquerading as SED". You will read about the "greedy"
option and that will help you solve problem #1.
>> 2) since this is unstructures text, how do I know where to stop (EOF)
You had mentioned that PLAIN_TEXT will consume whatever, until it finds a
"<" (start tag). So, if your text didn't end with a tag, your INPUT_TEXT
should be able to eat your input. If your text ends with a "</mytag>", your
ENDTAG will successfully parse it and stop. Look at this page to understand
how EOF works, in ANTLR: http://www.antlr.org/doc/metalang.html (Third
paragraph in the first page talks about EOF).
>>3) I also need to support HTML tags in the PLAIN_TEXT (I have to consume
them in PLAIN_TEXT)
Basically, this is a subset of your original problem. If PLAIN_TEXT is of
the form:
This is <b>plain</b> text.
It is similar to your original template, except that it has HTML tags
embedded in it, instead of XML tags.
__________
I'm a newbie too, to ANTLR. I had similar problems and reading these
sections helped me solve them and understand how EOF worked. I hope my
pointers help you find your answers.
Bharath
~ Give me the tool and I shall move the earth ~
-----Original Message-----
From: Priyank RASTOGI [mailto:priyank at osellus.com]
Sent: Monday, June 07, 2004 5:05 AM
To: antlr-interest at yahoogroups.com
Subject: [antlr-interest] parsing unstructured text with xml-type tags.
Hi,
Sorry if this is very trivial but I could not find a solution in any
of the examples.
I am writing a parser for a template that contains unstructured text
with embedded XML-type tags. So one example of such a template is
This is a test template created by <mytag>some text here</mytag>.
Thanks for your time.
In the lexer, I have defined rules like
PLAIN_TEXT: consume whatever you see till < tag
STARTTAG: copied from xml parser in examples
ENDTAG: copied from xml parser in examples
So the stream of tokes i am expecting is
PLAIN_TEXT STARTTAG PLAIN_TEXT ENDTAG
Here are my problems...
1) PLAIN_TEXT consumes everything
2) since this is unstructures text, how do I know where to stop (EOF)
3) I also need to support HTML tags in the PLAIN_TEXT (I have to
consume them in PLAIN_TEXT)
I am stuck how to go about it.
Any pointers would be greatly appreciated.
Thanks
Priyank
Yahoo! Groups Links
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list