[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Wed May 23 12:16:29 PDT 2007

Hello,

It is noteworthy to mention that my mail client (Mozilla Thunderbird) 
deals with it very well. Maybe having a look at their source could be 
useful (don't ask me where precisely though!).

I see that you don't define any whitespace in your grammar. Maybe 
dealing with the input line by line could make things simpler?

What about enabling backtracking? Why not define a non-greedy (.)+ rule 
for anychars? I think the latter would match when the other rules don't. 
I'm not sure 100%, but it is my impression that the generated parser 
behaves a bit differently than when its in a different rule.

Tell me what that gives:

  phrase
      : bolded
      | underlined
      | ( options {greedy=false;} : .)+ ;
      ;

MA

Collin VanDyck wrote:
> Hello
> 
> I'm trying to evaluate ANTLR to determine whether or not it would be a 
> good fit for a wiki that we're currently developing.
> 
> Essentially, the question boils down to how elegantly it would handle a 
> wide variety of somewhat unstructured input.  In other words, users are 
> going to be entering in rather freeform content (i.e. copying and 
> pasting form Word or some other character source), and I want ANTLR to 
> be able to accept all of the input but match special sequences.
> 
> An example of this would be:
> 
> "This is some *bold* wiki content that might also be _underlined_ in 
> places"
> 
> The special rules would simply output each character that doesn't fall 
> into a special rule, and then to recognize *bold* and _underlined_ 
> specially.
> 
> I've written a small ANTLR grammar which is able to parse this, but 
> fails pretty quickly when you do things like:
> 
> "This is some *irregular** input_"
> 
> In the latter case, I'd really just like for the first *irregular* to be 
> parsed as a bolded word, and since the other characters don't have 
> closing symbols, to be able to just treat them as fairly regular 
> characters like 'a', 'b', 'c', etc.
> 
> Is it possible and reasonable to use ANTLR for this purpose?  Can I 
> create a grammar which will accept ANYTHING, and simply be able to parse 
> out the bits and pieces that are interesting?
> 
> I'm pasting in the grammar I created.  I apologize in advance for the 
> incorrectness of it.
> 
> -Collin
> 
> ------------------
> 
> grammar WikiGrammar;
> 
> wiki
>     : phrase+
>     ;
> 
> phrase
>     : bolded
>     | underlined
>     | anychars
>     ;
>     
> bolded
>     : ASTERISK phrase ASTERISK
>     ;
>     
> underlined
>     : UNDERSCORE phrase UNDERSCORE
>     ;
>     
> anychars
>     : (CHAR)+
>     ;
> 
> UNDERSCORE
>     : '_'
>     ;   
> 
> ASTERISK
>     : '*'
>     ;
>     
> CHAR
>     : .
>     ;
> 
> 
> 
> 

-- 
Marc-André LAVERDIÈRE, B. Eng., M. A. Sc. (in progress)
Computer Security Laboratory - Laboratoire de sécurité informatique
CIISE, Université Concordia University, Montréal, Québec, Canada
www.ciise.concordia.ca

/"\
\ /    ASCII Ribbon Campaign
  X      against HTML e-mail
/ \

"Perseverance must finish its work so that you may be mature and 
complete, not lacking anything." -James 1:4