[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Wed May 23 14:24:45 PDT 2007

As this is seemingly completely context insensitive, could you not just did this with a filtering lexer? Or perhaps a rewritetokenstream

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Marc-André Laverdière
> Sent: Wednesday, May 23, 2007 12:16 PM
> To: Collin VanDyck
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Is ANTLR suitable for wiki grammar
> parsing?
> 
> Hello,
> 
> It is noteworthy to mention that my mail client (Mozilla Thunderbird)
> deals with it very well. Maybe having a look at their source could be
> useful (don't ask me where precisely though!).
> 
> I see that you don't define any whitespace in your grammar. Maybe
> dealing with the input line by line could make things simpler?
> 
> What about enabling backtracking? Why not define a non-greedy (.)+ rule
> for anychars? I think the latter would match when the other rules
> don't.
> I'm not sure 100%, but it is my impression that the generated parser
> behaves a bit differently than when its in a different rule.
> 
> Tell me what that gives:
> 
>   phrase
>       : bolded
>       | underlined
>       | ( options {greedy=false;} : .)+ ;
>       ;
> 
> MA
> 
> Collin VanDyck wrote:
> > Hello
> >
> > I'm trying to evaluate ANTLR to determine whether or not it would be
> a
> > good fit for a wiki that we're currently developing.
> >
> > Essentially, the question boils down to how elegantly it would handle
> a
> > wide variety of somewhat unstructured input.  In other words, users
> are
> > going to be entering in rather freeform content (i.e. copying and
> > pasting form Word or some other character source), and I want ANTLR
> to
> > be able to accept all of the input but match special sequences.
> >
> > An example of this would be:
> >
> > "This is some *bold* wiki content that might also be _underlined_ in
> > places"
> >
> > The special rules would simply output each character that doesn't
> fall
> > into a special rule, and then to recognize *bold* and _underlined_
> > specially.
> >
> > I've written a small ANTLR grammar which is able to parse this, but
> > fails pretty quickly when you do things like:
> >
> > "This is some *irregular** input_"
> >
> > In the latter case, I'd really just like for the first *irregular* to
> be
> > parsed as a bolded word, and since the other characters don't have
> > closing symbols, to be able to just treat them as fairly regular
> > characters like 'a', 'b', 'c', etc.
> >
> > Is it possible and reasonable to use ANTLR for this purpose?  Can I
> > create a grammar which will accept ANYTHING, and simply be able to
> parse
> > out the bits and pieces that are interesting?
> >
> > I'm pasting in the grammar I created.  I apologize in advance for the
> > incorrectness of it.
> >
> > -Collin
> >
> > ------------------
> >
> > grammar WikiGrammar;
> >
> > wiki
> >     : phrase+
> >     ;
> >
> > phrase
> >     : bolded
> >     | underlined
> >     | anychars
> >     ;
> >
> > bolded
> >     : ASTERISK phrase ASTERISK
> >     ;
> >
> > underlined
> >     : UNDERSCORE phrase UNDERSCORE
> >     ;
> >
> > anychars
> >     : (CHAR)+
> >     ;
> >
> > UNDERSCORE
> >     : '_'
> >     ;
> >
> > ASTERISK
> >     : '*'
> >     ;
> >
> > CHAR
> >     : .
> >     ;
> >
> >
> >
> >
> 
> --
> Marc-André LAVERDIÈRE, B. Eng., M. A. Sc. (in progress)
> Computer Security Laboratory - Laboratoire de sécurité informatique
> CIISE, Université Concordia University, Montréal, Québec, Canada
> www.ciise.concordia.ca
> 
> /"\
> \ /    ASCII Ribbon Campaign
>   X      against HTML e-mail
> / \
> 
> "Perseverance must finish its work so that you may be mature and
> complete, not lacking anything." -James 1:4