[antlr-interest] XML island grammar

Matthieu Riou matthieu at offthelip.org
Mon Oct 8 11:33:27 PDT 2007


On 10/8/07, Susan Jolly <easjolly at ix.netcom.com> wrote:
>
> If you don't have a lot of different XML elements, you could let the lexer
> look for "<xyz" rather than "<". Alternatively, could you have "<"
> characters that aren't part of XML tags be escaped with &lt;?


Unfortunately the grammar must accept any XML snippet and I've had enough of
XML to hate the &lt; escaping :)

Another possibility is to have your main lexer grab an entire XML section
> plus tags and then actually lex that section with another lexer.  You'd
> use
> something like the following to grab the section:
> XML: '<' ( options {greedy=false;} : . )* '/>';


I can give that a try although if I have something like:

<foo> <bar> baz </bar> </foo>

wouldn't that match only up to the closing bar element (hence ignoring the
closing foo)?

Thanks!
Matthieu

The key here is that with ANTLR v3 you can override the emit method in the
> lexer.  See "Emitting More Than One Token per Lexer Rule" on p. 94 of
> Section 4.3 in the ANTLR book. In other words, you don't have to let the
> first lexer emit the whole enchilada as a single token.
>
> The emit method can do anything it wants, including invoking another lexer
> to "re-tokenize".  This is actually simpler than the way v2 handled
> multiple
> lexers using what it called a "shared input stream" and requiring that the
> main lexer be able to detect just the start of the island as a token.
>
> HTH
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20071008/3cbe833a/attachment.html 


More information about the antlr-interest mailing list