[antlr-interest] Wiki Markup Like Grammar Example

Mon Mar 1 16:03:03 PST 2010

Eric,

   I have a grammar for Wikimedia that recognizes ~90% of the pages on
Wikipedia.  It probably needs quite a bit of tender loving care.  I
can't release it, because I did it on the company dime, and I haven't
yet won the argument to release it.  Sorry.

   The key question is: Do you expect your users to have perfect
markup, or do you need to gracefully handle *anything* given to you?

   If users are expected to have perfect markup, and you can tweak the
markup to be easy to parse, the problem is fairly straight forward.
If you can make the grammar public, I'd happily look at it on or off
list.

   I wanted to parse Wikimedia markup to be able to allow for the
extraction of text with higher semantic quality for some NLP research.
 I spent a bunch of time, there are a number of grammars for ANTLR for
that.  None of them that I saw were in a working state.

The approach I took was:

Lex all characters that can ever be treated as "special", as a
individual token types.  If it can have a special meaning based upon
it's position in a line, I'd treat those independently (so '=' at the
beginning of a line vs. not the beginning means something different,
same thing for '*', some where STAR tokens, others were
LIST_ITEM_START tokens).

I treated the "bold" and the "italics" cases as toggles (that's the
way Wikimedia's PHP code handles it).  You later walk the code to
identify if this is an "on" or an "off" case.  It also was careful
about accounting for an inline <b> or </b>.  If all you are doing is
translating to HTML, that makes sense.

All non-newline whitespace, I lexed into whitespace tokens.  From
there the trickiest part was realizing that all multi-character markup
couldn't be handled as tokens (if it could be ambiguous).  So handling
{| vs. {{ vs. {{{ was couldn't be done in the lexer, had to be done in
the parser.  Otherwise it was impossible to match things like
{{{{{Foo}}} Bar}} or {{{{{Foo}} Bar}}}.  The lexer is greedy, and is
incapable of matching the former, only the latter.

Finally I had a token type for "INLINE_TEXT" that was a constructed
token type.  All of the characters that could never mean anything
special would be lexed into "TEXT" tokens.  In my "I give up, your
markup is busted" rule that was right near the top most rule, I'd push
any markup I couldn't parse with the sane rules, underneath an
"INLINE_TEXT".  So anything underneath of an "INLINE_TEXT" I new to
just push out after escaping for the output.

I had a couple of things that were very difficult, including the
<nowiki> tag.  That I just lexed from <nowiki> to </nowiki> with the
plan that I would write a custom grammar/handling for those tokens
special.  I saw several attempts to disable token generation, and have
the grammar maintain state.  It was just way to complex from what I
saw.  Trying to cram way too much logic lexer.

Finally, plan on making a ton of things optional.  Normal people can't
write things that parsers are happy with.  So virtually every "close"
tag had to be optional.  Even if I just read until EOF, and accepted
that.  So a lot of <ref> tags would go until the end of the document,
because they weren't closed correctly.  Or a bullet item that was on a
line that didn't end with a newline, it ended with EOF.  Lots of rules
needed lots of flexibility.  If you have a bunch of sample text, I'd
plan on brute force testing your grammar upon it.  I just downloaded
the Wikimedia dumps, and wrote the XML parser, and feed the text to my
parser and wrote out dump files upon every page I couldn't parse.  I
slowly fixed each case one by one, and eventually got pretty solid
coverage.

The other thing I found very useful is to imagine how I would write
the code by hand.  Then I'd think about how could I get ANTLR to
generate code that looked like that.  One problem I had, but never got
a chance to try out my solution was mismatched open/closes.  In
Wikimedia if you have the following:

{{ foo [[ bar }} ]]

It considers the [[ bar }} ]] to be a link, and you have a dangling {{
at the beginning of the text.  In that case, I wanted to use a scope,
with a isTerminated(tokenStream) function.  So every loop would
continue as long as "isTerminated" returned false.  As I saw a new
open, I'd push onto the stack the scope the thing that matched the
close just seen.  I had a ton of pathological cases in my attempts to
work around this.  Far too many cases ended up in the
"bad_markup_catchall" rule that I needed to match if I wanted to work
for on the production Wikitext.  I think it'll work because I can see
that's ANTLR outputs rules that are very similar to the types of code
in the Wikimedia PHP rules.  It's just much faster, and far easier to
maintain.  For bonus, it's not even in PHP (*shudders*)....

Kirby

On Sun, Feb 28, 2010 at 6:44 PM, Eric Webb <eric at collectivegenius.net> wrote:
> I am working on a note taking language (think wiki markup, but somewhat
> more specialized), and I am trying to use Antlr 3.2 to parse and create
> an AST of the content/structure.
>
> Does anyone have any sample wiki (or similar) grammar that works in
> Antlr 3.2?  I found this:
>
> http://dirkriehle.com/publications/2008/wiki-creole/
>
> which is for Creole 1.0 (with is meant to be a standard Wiki markup
> language).  Unfortunately, it doesn't work for me.
>
> Thanks in advance.
>
> cheers,
> eric
>
>
> ** More Information for the Interested **
>
>
> I have a working grammar that can parse document sections and
> paragraphs, for example:
>
>    *section 1
>    ---
>
>    this is a paragraph in the section
>
>    section 1.1
>    ---
>
>    nested sections!
>
>    ---
>    ---
>    *
>
>
> It works great (after a bit of trial and error and reading The
> Definitive ANTLR Reference and giving gUnit a spin).  Now, I am trying
> to start supporting other standard wiki type things in my grammar, like:
>
> Lists:
>
>  * this is a list item
>  * this is another list item
>
> Formatting:
>
> this is **bold**!
> this is //italic//.
>
> The problem is, my grammar for the the sections/paragraphs above is
> "word" based (ie: I create WORD tokens with the lexer), while the Creole
> example tokenizes the stream into character tokens to be able to
> identify the list, formatting, and other delimiters.
>
> I am hoping for a working example that shows some best practices for a
> language like wiki markup before I go about re-structuring my grammar to
> support these other things (which I think I need to do).
>
> Or, if I am totally misguided, let me know that too.
>
> cheers,
> eric
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>