[antlr-interest] Doing translation in the lexer (the right tool for the job?)

Tue Jun 19 23:57:15 PDT 2007

One of the many translation tasks I'd like to use ANTLR for is  
translating wikitext to HTML. Two things make this quite difficult:

1. The translator has to be extremely tolerant of malformed input  
(users may not understand the wikitext syntax or may deliberately  
subvert it) and make reasonable "cleanup" attempts for bad input  
(such as closing tags that were left open by the user).

2. Wikitext structure is not unlike HTML structure, which means it is  
highly recursive and is very sensitive to nesting and ordering issues.

This means that the translator must be able to handle things like:

- improperly interleaved tags: <em><strong></em></strong>

- missing closing tags: <em>foo

- missing opening tags: foo</em>

- mismatched tags (really just a combination of the above): <em>foo</ 
strong>

And it must enforce scoping rules that can be quite complicated at  
times; for example:

- an <em> span may appear inside a <strong> span;

- and a <strong> span may appear inside an <em> span;

- but an <em> span may not appear inside an <em> span, so both  
<em><em>foo</em></em> and <em><strong><em>foo</em></strong></em> are  
invalid

- a <blockquote> may contain other block level elements such as <p>  
and even another <blockquote>;

- but <p> may not contain <blockquote>

- etc, etc, etc...

So if it weren't for the fact that the translator has to gracefully  
handle bad input the task would be significantly easier. In other  
words, if ANTLR could assume that all input were perfectly formed and  
just throw errors on encountering bad input then it wouldn't be too  
hard. But even assuming syntactically perfect input there is still  
the issue of the complicated scoping rules.

My first naïve attempt at doing this had me trying to express scope  
restrictions using only standard parser rules, but I very quickly got  
into trouble with recursion due to non-LL(*) decisions and non- 
determinisms.

So I then tried using predicates combined with a stack stored in a  
global scope. Basically, on entering a new rule, information would be  
pushed onto the stack describing what tags were explicitly allowed  
and what tags explicitly disallowed. Semantic predicates could then  
be used in subsequent rules to check the stack to see if a particular  
alternative to match.

For example, on entering a <blockquote>, the list of allowed tags is 
basically any block-level tag (, <blockquote> etc) so those are 
pushed as a list onto the stack. On entering a , the list of 
allowed tags is basically any inline element (, etc). 
The predicate's job is then basically to walk the stack and confirm 
that (a) a given tag is explicitly allowed, and (b) it is not 
explicitly disallowed in any previous level of the stack.

I soon found that the predicates were being hoisted/lowered into  
unexpected places which prevented that idea from working. Perhaps  
someone with more intimate knowledge of how ANTLR does this would be  
able to make the solution work, but I didn't see much hope of getting  
it to work so I abandoned it. I also wasn't even sure that my stack  
of allowed/disallowed tags was a good design, even in the abstract  
sense.

So finally, the thing I'm doing is jettisoning the parser entirely  
and doing absolutely everything in the lexer. Ok, not exactly: I've  
made an extremely simple filtering lexer which has no predicates, no  
actions, just spits out tokens and has a catch-all rule at the bottom  
so that all input characters are guaranteed to come out as tokens on  
the other end.

Then in my support code I ask the lexer for tokens one at a time and 
perform translation "by hand". Once again I maintain a stack, but 
this time a slightly simpler one. Instead of maintaining a list of 
allowed and disallowed tags at each level I instead maintain a stack 
which indicates my current scope, kind of like a CSS selector: ie. if 
I've previously seen <blockquote>, and then my stack 
contains "blockquote", "p" and "em"; if I see another "" I know I 
can't match it because that's already on the stack, and on seeing "" I pop it off the stack. In this way I can basically ignore 
unexpected tags (well, actually I output them literally as 
"&lt;em&gt;" etc) and insert missing tags when I think I should have 
seen them.

All of this manually management is a bit of effort, but it does work  
and that's very important: the code is simple enough to look at and  
see that it will work. If I tried to do all of this in ANTLR then  
there would be an element of black magic; I'd never feel sure that I  
knew what ANTLR was really doing behind the scenes. But I am left  
with the lingering doubt that I might not be doing things the best  
way. Do you think that this lexer-only-plus-supporting-code approach  
is the using the right tool for the job? Or am I missing something  
obvious?

Cheers,
Wincent