[antlr-interest] XML island grammar

David Holroyd dave at badgers-in-foil.co.uk
Mon Oct 8 14:58:07 PDT 2007


On Mon, Oct 08, 2007 at 11:33:27AM -0700, Matthieu Riou wrote:
> On 10/8/07, Susan Jolly <easjolly at ix.netcom.com> wrote:
> > If you don't have a lot of different XML elements, you could let the lexer
> > look for "<xyz" rather than "<". Alternatively, could you have "<"
> > characters that aren't part of XML tags be escaped with &lt;?
> 
> 
> Unfortunately the grammar must accept any XML snippet and I've had enough of
> XML to hate the &lt; escaping :)
> 
> Another possibility is to have your main lexer grab an entire XML section
> > plus tags and then actually lex that section with another lexer.  You'd
> > use
> > something like the following to grab the section:
> > XML: '<' ( options {greedy=false;} : . )* '/>';
> 
> 
> I can give that a try although if I have something like:
> 
> <foo> <bar> baz </bar> </foo>
> 
> wouldn't that match only up to the closing bar element (hence ignoring the
> closing foo)?

Indeed.  So, the next step along this path is to basically push the
entire 'parser' into the lexer.

  XML: XML_START XML_BODY? XML_END | XML_EMPTY;
  ...

etc.  That was the approach used when I originally ported someone else's
AS3 grammar from ANTLRv2 to v3, and it didn't work for me (quite apart
from the fact that real E4X allows embedded expressions from the outer
language, would could contain string literals, which might contain
stuff that *looks* like XML, dragons, etc.).

At the time, I decided that this approach can't actually work, due to
the way ANTLR's lexers operate.  e.g. on the first character of the
input '<xyz', the lexer can see that either the XML or LESS_THAN tokens
might consume the input, but as soon as it sees the second letter, the
lexer decides that XML will create the longest token, so that rule wins
even though we actually did want the match to be (LESS_THAN, IDENT).

I might have got that wrong though :)


ta,
dave

-- 
http://david.holroyd.me.uk/


More information about the antlr-interest mailing list