[antlr-interest] Doing translation in the lexer (the right tool for the job?)
antlr at jazillian.com
Wed Jun 20 06:32:01 PDT 2007
Have you looked into Tidy http://tidy.sourceforge.net/
and JTidy http://jtidy.sourceforge.net/ ?
I'd start with one of these and keep going.
Doesn't sound like ANTLR is the right tool for this.
Wincent Colaiuta wrote:
> One of the many translation tasks I'd like to use ANTLR for is
> translating wikitext to HTML. Two things make this quite difficult:
> 1. The translator has to be extremely tolerant of malformed input
> (users may not understand the wikitext syntax or may deliberately
> subvert it) and make reasonable "cleanup" attempts for bad input (such
> as closing tags that were left open by the user).
> 2. Wikitext structure is not unlike HTML structure, which means it is
> highly recursive and is very sensitive to nesting and ordering issues.
> This means that the translator must be able to handle things like:
> - improperly interleaved tags: <em><strong></em></strong>
> - missing closing tags: <em>foo
> - missing opening tags: foo</em>
> - mismatched tags (really just a combination of the above):
> And it must enforce scoping rules that can be quite complicated at
> times; for example:
> - an <em> span may appear inside a <strong> span;
> - and a <strong> span may appear inside an <em> span;
> - but an <em> span may not appear inside an <em> span, so both
> <em><em>foo</em></em> and <em><strong><em>foo</em></strong></em> are
> - a <blockquote> may contain other block level elements such as <p>
> and even another <blockquote>;
> - but <p> may not contain <blockquote>
> - etc, etc, etc...
> So if it weren't for the fact that the translator has to gracefully
> handle bad input the task would be significantly easier. In other
> words, if ANTLR could assume that all input were perfectly formed and
> just throw errors on encountering bad input then it wouldn't be too
> hard. But even assuming syntactically perfect input there is still the
> issue of the complicated scoping rules.
> My first naïve attempt at doing this had me trying to express scope
> restrictions using only standard parser rules, but I very quickly got
> into trouble with recursion due to non-LL(*) decisions and
> So I then tried using predicates combined with a stack stored in a
> global scope. Basically, on entering a new rule, information would be
> pushed onto the stack describing what tags were explicitly allowed and
> what tags explicitly disallowed. Semantic predicates could then be
> used in subsequent rules to check the stack to see if a particular
> alternative to match.
> For example, on entering a <blockquote>, the list of allowed tags is
> basically any block-level tag (<p>, <blockquote> etc) so those are
> pushed as a list onto the stack. On entering a <p>, the list of
> allowed tags is basically any inline element (<em>, <strong> etc). The
> predicate's job is then basically to walk the stack and confirm that
> (a) a given tag is explicitly allowed, and (b) it is not explicitly
> disallowed in any previous level of the stack.
> I soon found that the predicates were being hoisted/lowered into
> unexpected places which prevented that idea from working. Perhaps
> someone with more intimate knowledge of how ANTLR does this would be
> able to make the solution work, but I didn't see much hope of getting
> it to work so I abandoned it. I also wasn't even sure that my stack of
> allowed/disallowed tags was a good design, even in the abstract sense.
> So finally, the thing I'm doing is jettisoning the parser entirely and
> doing absolutely everything in the lexer. Ok, not exactly: I've made
> an extremely simple filtering lexer which has no predicates, no
> actions, just spits out tokens and has a catch-all rule at the bottom
> so that all input characters are guaranteed to come out as tokens on
> the other end.
> Then in my support code I ask the lexer for tokens one at a time and
> perform translation "by hand". Once again I maintain a stack, but this
> time a slightly simpler one. Instead of maintaining a list of allowed
> and disallowed tags at each level I instead maintain a stack which
> indicates my current scope, kind of like a CSS selector: ie. if I've
> previously seen <blockquote>, <p> and <em> then my stack contains
> "blockquote", "p" and "em"; if I see another "<em>" I know I can't
> match it because that's already on the stack, and on seeing "</em>" I
> pop it off the stack. In this way I can basically ignore unexpected
> tags (well, actually I output them literally as "<em>" etc) and
> insert missing tags when I think I should have seen them.
> All of this manually management is a bit of effort, but it does work
> and that's very important: the code is simple enough to look at and
> see that it will work. If I tried to do all of this in ANTLR then
> there would be an element of black magic; I'd never feel sure that I
> knew what ANTLR was really doing behind the scenes. But I am left with
> the lingering doubt that I might not be doing things the best way. Do
> you think that this lexer-only-plus-supporting-code approach is the
> using the right tool for the job? Or am I missing something obvious?
More information about the antlr-interest