[antlr-interest] Is ANTLR suitable for wiki grammar parsing?
Wincent Colaiuta
win at wincent.com
Wed Jun 6 22:58:47 PDT 2007
El 7/6/2007, a las 1:19, Jim Idle escribió:
> So, the approach is that you explicitly tokenize those characters
> used in markup, and those that mean the markup is not valid. For
> instance spaces mean that wasn't a bold character. Then you
> predicate on a valid markup syntax, and just consume everything
> that is not a valid markup sequence. You could also do a similar
> thing with backtracking, but it is better to be explicit. I think
> that starting here, you should be able to add each wiki markup (do
> it one at a time and test it in ANTLRWorks in debug mode) and
> follow the same formulaic approach.
Thanks for all your input on this, Jim... it is really good to have
someone with so much ANTLR experience indicating the way to go forward.
Yes, that basic idea (tokenizing sequences that have special meaning
in the markup, and consuming non-valid markup) was what I was trying
to do; thanks for the advice on using predicates to make this explicit.
> Then you will probably end up with something more maintainable than
> regular expressions, which, cool as they are, are not for 'parsing'
> with really.
Yes, totally agreed. It can certainly be done, and has been done many
times, but I would rather avoid this kind of thing (almost 5,000
lines of hand-crafted PHP):
<http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/
Parser.php?view=markup>
> Using this grammar it would be easy to add actions to spit out HTML
> or whatever it is you want, as you go, as you probably don't need a
> tree to deal with this.
Yes, I agree. In this case the wikitext has a very specific
application (translation to HTML); there's probably no need for any
kind of intermediate format, it should be just a straight translation
from one form to another...
Cheers,
Wincent
More information about the antlr-interest
mailing list