[antlr-interest] Doing translation in the lexer (the right tool for the job?)
Wincent Colaiuta
win at wincent.com
Tue Jun 19 23:57:15 PDT 2007
One of the many translation tasks I'd like to use ANTLR for is
translating wikitext to HTML. Two things make this quite difficult:
1. The translator has to be extremely tolerant of malformed input
(users may not understand the wikitext syntax or may deliberately
subvert it) and make reasonable "cleanup" attempts for bad input
(such as closing tags that were left open by the user).
2. Wikitext structure is not unlike HTML structure, which means it is
highly recursive and is very sensitive to nesting and ordering issues.
This means that the translator must be able to handle things like:
- improperly interleaved tags: <em><strong></em></strong>
- missing closing tags: <em>foo
- missing opening tags: foo</em>
- mismatched tags (really just a combination of the above): <em>foo</
strong>
And it must enforce scoping rules that can be quite complicated at
times; for example:
- an <em> span may appear inside a <strong> span;
- and a <strong> span may appear inside an <em> span;
- but an <em> span may not appear inside an <em> span, so both
<em><em>foo</em></em> and <em><strong><em>foo</em></strong></em> are
invalid
- a <blockquote> may contain other block level elements such as <p>
and even another <blockquote>;
- but <p> may not contain <blockquote>
- etc, etc, etc...
So if it weren't for the fact that the translator has to gracefully
handle bad input the task would be significantly easier. In other
words, if ANTLR could assume that all input were perfectly formed and
just throw errors on encountering bad input then it wouldn't be too
hard. But even assuming syntactically perfect input there is still
the issue of the complicated scoping rules.
My first naïve attempt at doing this had me trying to express scope
restrictions using only standard parser rules, but I very quickly got
into trouble with recursion due to non-LL(*) decisions and non-
determinisms.
So I then tried using predicates combined with a stack stored in a
global scope. Basically, on entering a new rule, information would be
pushed onto the stack describing what tags were explicitly allowed
and what tags explicitly disallowed. Semantic predicates could then
be used in subsequent rules to check the stack to see if a particular
alternative to match.
For example, on entering a <blockquote>, the list of allowed tags is
basically any block-level tag (<p>, <blockquote> etc) so those are
pushed as a list onto the stack. On entering a <p>, the list of
allowed tags is basically any inline element (<em>, <strong> etc).
The predicate's job is then basically to walk the stack and confirm
that (a) a given tag is explicitly allowed, and (b) it is not
explicitly disallowed in any previous level of the stack.
I soon found that the predicates were being hoisted/lowered into
unexpected places which prevented that idea from working. Perhaps
someone with more intimate knowledge of how ANTLR does this would be
able to make the solution work, but I didn't see much hope of getting
it to work so I abandoned it. I also wasn't even sure that my stack
of allowed/disallowed tags was a good design, even in the abstract
sense.
So finally, the thing I'm doing is jettisoning the parser entirely
and doing absolutely everything in the lexer. Ok, not exactly: I've
made an extremely simple filtering lexer which has no predicates, no
actions, just spits out tokens and has a catch-all rule at the bottom
so that all input characters are guaranteed to come out as tokens on
the other end.
Then in my support code I ask the lexer for tokens one at a time and
perform translation "by hand". Once again I maintain a stack, but
this time a slightly simpler one. Instead of maintaining a list of
allowed and disallowed tags at each level I instead maintain a stack
which indicates my current scope, kind of like a CSS selector: ie. if
I've previously seen <blockquote>, <p> and <em> then my stack
contains "blockquote", "p" and "em"; if I see another "<em>" I know I
can't match it because that's already on the stack, and on seeing "</
em>" I pop it off the stack. In this way I can basically ignore
unexpected tags (well, actually I output them literally as
"<em>" etc) and insert missing tags when I think I should have
seen them.
All of this manually management is a bit of effort, but it does work
and that's very important: the code is simple enough to look at and
see that it will work. If I tried to do all of this in ANTLR then
there would be an element of black magic; I'd never feel sure that I
knew what ANTLR was really doing behind the scenes. But I am left
with the lingering doubt that I might not be doing things the best
way. Do you think that this lexer-only-plus-supporting-code approach
is the using the right tool for the job? Or am I missing something
obvious?
Cheers,
Wincent
More information about the antlr-interest
mailing list