[antlr-interest] Doing translation in the lexer (the right tool for the job?)

Wed Jun 20 06:32:01 PDT 2007

Have you looked into Tidy http://tidy.sourceforge.net/
and JTidy http://jtidy.sourceforge.net/   ?

I'd start with one of these and keep going.
Doesn't sound like ANTLR is the right tool for this.

Andy

Wincent Colaiuta wrote:
> One of the many translation tasks I'd like to use ANTLR for is 
> translating wikitext to HTML. Two things make this quite difficult:
>
> 1. The translator has to be extremely tolerant of malformed input 
> (users may not understand the wikitext syntax or may deliberately 
> subvert it) and make reasonable "cleanup" attempts for bad input (such 
> as closing tags that were left open by the user).
>
> 2. Wikitext structure is not unlike HTML structure, which means it is 
> highly recursive and is very sensitive to nesting and ordering issues.
>
> This means that the translator must be able to handle things like:
>
> - improperly interleaved tags: <em><strong></em></strong>
>
> - missing closing tags: <em>foo
>
> - missing opening tags: foo</em>
>
> - mismatched tags (really just a combination of the above): 
> <em>foo</strong>
>
> And it must enforce scoping rules that can be quite complicated at 
> times; for example:
>
> - an <em> span may appear inside a <strong> span;
>
> - and a <strong> span may appear inside an <em> span;
>
> - but an <em> span may not appear inside an <em> span, so both 
> <em><em>foo</em></em> and <em><strong><em>foo</em></strong></em> are 
> invalid
>
> - a <blockquote> may contain other block level elements such as <p> 
> and even another <blockquote>;
>
> - but <p> may not contain <blockquote>
>
> - etc, etc, etc...
>
> So if it weren't for the fact that the translator has to gracefully 
> handle bad input the task would be significantly easier. In other 
> words, if ANTLR could assume that all input were perfectly formed and 
> just throw errors on encountering bad input then it wouldn't be too 
> hard. But even assuming syntactically perfect input there is still the 
> issue of the complicated scoping rules.
>
> My first naïve attempt at doing this had me trying to express scope 
> restrictions using only standard parser rules, but I very quickly got 
> into trouble with recursion due to non-LL(*) decisions and 
> non-determinisms.
>
> So I then tried using predicates combined with a stack stored in a 
> global scope. Basically, on entering a new rule, information would be 
> pushed onto the stack describing what tags were explicitly allowed and 
> what tags explicitly disallowed. Semantic predicates could then be 
> used in subsequent rules to check the stack to see if a particular 
> alternative to match.
>
> For example, on entering a <blockquote>, the list of allowed tags is 
> basically any block-level tag (<p>, <blockquote> etc) so those are 
> pushed as a list onto the stack. On entering a <p>, the list of 
> allowed tags is basically any inline element (<em>, <strong> etc). The 
> predicate's job is then basically to walk the stack and confirm that 
> (a) a given tag is explicitly allowed, and (b) it is not explicitly 
> disallowed in any previous level of the stack.
>
> I soon found that the predicates were being hoisted/lowered into 
> unexpected places which prevented that idea from working. Perhaps 
> someone with more intimate knowledge of how ANTLR does this would be 
> able to make the solution work, but I didn't see much hope of getting 
> it to work so I abandoned it. I also wasn't even sure that my stack of 
> allowed/disallowed tags was a good design, even in the abstract sense.
>
> So finally, the thing I'm doing is jettisoning the parser entirely and 
> doing absolutely everything in the lexer. Ok, not exactly: I've made 
> an extremely simple filtering lexer which has no predicates, no 
> actions, just spits out tokens and has a catch-all rule at the bottom 
> so that all input characters are guaranteed to come out as tokens on 
> the other end.
>
> Then in my support code I ask the lexer for tokens one at a time and 
> perform translation "by hand". Once again I maintain a stack, but this 
> time a slightly simpler one. Instead of maintaining a list of allowed 
> and disallowed tags at each level I instead maintain a stack which 
> indicates my current scope, kind of like a CSS selector: ie. if I've 
> previously seen <blockquote>, <p> and <em> then my stack contains 
> "blockquote", "p" and "em"; if I see another "<em>" I know I can't 
> match it because that's already on the stack, and on seeing "</em>" I 
> pop it off the stack. In this way I can basically ignore unexpected 
> tags (well, actually I output them literally as "&lt;em&gt;" etc) and 
> insert missing tags when I think I should have seen them.
>
> All of this manually management is a bit of effort, but it does work 
> and that's very important: the code is simple enough to look at and 
> see that it will work. If I tried to do all of this in ANTLR then 
> there would be an element of black magic; I'd never feel sure that I 
> knew what ANTLR was really doing behind the scenes. But I am left with 
> the lingering doubt that I might not be doing things the best way. Do 
> you think that this lexer-only-plus-supporting-code approach is the 
> using the right tool for the job? Or am I missing something obvious?
>
> Cheers,
> Wincent
>
>