[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Wed Jun 6 16:19:42 PDT 2007

OK, taking a slightly different approach here, and I think that same approach would work for all Wiki type tags that I know of (but then although I like wikis, I am not overly enamored with the markup language ;-). Then for instance for bold, the following wil work nicely, and will distinguish the case you have below. Using the grammar:

grammar wiki;

body	: text* EOF
	;

text 	: (marked)=>marked
	| DROSS
	| WS
	| BOLD
	;

marked
	: BOLD DROSS+ BOLD
	;

WS 	: ' ' | '\t' | '\n' | '\r' 	;
BOLD	: '*' 				;
DROSS	: . 					;

So, the approach is that you explicitly tokenize those characters used in markup, and those that mean the markup is not valid. For instance spaces mean that wasn't a bold character. Then you predicate on a valid markup syntax, and just consume everything that is not a valid markup sequence. You could also do a similar thing with backtracking, but it is better to be explicit. I think that starting here, you should be able to add each wiki markup (do it one at a time and test it in ANTLRWorks in debug mode) and follow the same formulaic approach. 

Then you will probably end up with something more maintainable than regular expressions, which, cool as they are, are not for 'parsing' with really.

Using this grammar it would be easy to add actions to spit out HTML or whatever it is you want, as you go, as you probably don't need a tree to deal with this.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Wincent Colaiuta
> Sent: Wednesday, June 06, 2007 3:15 PM
> To: Randall R Schulz
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Is ANTLR suitable for wiki grammar
> parsing?
> 
> El 6/6/2007, a las 20:26, Randall R Schulz escribió:
> 
> > On Wednesday 06 June 2007 11:16, Martin d'Anjou wrote:
> >>> However, I cannot match something like:
> >>>
> >>> *bold* abc*de
> >>>
> >>> As it fails because there is no following '*' after de.
> >>>
> >>> And I think that this is essentially my problem.  I do want
> >>> something like
> >>>
> >>> *bold* abc*de
> >>>
> >>> To be accepted, and i'd like for the *bold* to be matched in the
> >>> bolded parser rule, but since the rest of the line doesn't match,
> to
> >>> simply count abc*de as a regular phrase.
> >>>
> >>> Is this possible?
> >>
> >> I am very interested in knowing if this is possible as well. I have
> >> many problems where input is very unstructured, and I am not
> >> convinced ANTLR is the right solution for those problems.
> >
> > My original feeling about the OP's problem is just this. Context-free
> > grammars are all about structured. Rigid structure, precisely
> defined.
> > I don't see a parser generator as the tool of choice for loosely
> > structured or imprecisely defined inputs.
> >
> > The problem is that the number of rules you'd need and the
> > difficulty in
> > preventing unwanted interactions between those rules make this a
> > problem that verges on the insoluble with what a CFG parser generator
> > gives you.
> >
> > IMO, of course.
> 
> So what's the alternative? MediaWiki, for example, uses a very
> complicated set of hand-coded regular expressions. It works pretty
> well, but it does have its bugs and it's difficult to maintain. I'm
> hoping that the answer is not "hand-coded regular expressions"...
> 
> Cheers,
> Wincent
>