[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Wed Jun 6 22:58:47 PDT 2007

El 7/6/2007, a las 1:19, Jim Idle escribió:

> So, the approach is that you explicitly tokenize those characters  
> used in markup, and those that mean the markup is not valid. For  
> instance spaces mean that wasn't a bold character. Then you  
> predicate on a valid markup syntax, and just consume everything  
> that is not a valid markup sequence. You could also do a similar  
> thing with backtracking, but it is better to be explicit. I think  
> that starting here, you should be able to add each wiki markup (do  
> it one at a time and test it in ANTLRWorks in debug mode) and  
> follow the same formulaic approach.

Thanks for all your input on this, Jim... it is really good to have  
someone with so much ANTLR experience indicating the way to go forward.

Yes, that basic idea (tokenizing sequences that have special meaning  
in the markup, and consuming non-valid markup) was what I was trying  
to do; thanks for the advice on using predicates to make this explicit.

> Then you will probably end up with something more maintainable than  
> regular expressions, which, cool as they are, are not for 'parsing'  
> with really.

Yes, totally agreed. It can certainly be done, and has been done many  
times, but I would rather avoid this kind of thing (almost 5,000  
lines of hand-crafted PHP):

<http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/ 
Parser.php?view=markup>

> Using this grammar it would be easy to add actions to spit out HTML  
> or whatever it is you want, as you go, as you probably don't need a  
> tree to deal with this.

Yes, I agree. In this case the wikitext has a very specific  
application (translation to HTML); there's probably no need for any  
kind of intermediate format, it should be just a straight translation  
from one form to another...

Cheers,
Wincent