[antlr-interest] Is ANTLR suitable for wiki grammar parsing?
Marc-André Laverdière
ma_laver at ciise.concordia.ca
Wed May 23 12:16:29 PDT 2007
Hello,
It is noteworthy to mention that my mail client (Mozilla Thunderbird)
deals with it very well. Maybe having a look at their source could be
useful (don't ask me where precisely though!).
I see that you don't define any whitespace in your grammar. Maybe
dealing with the input line by line could make things simpler?
What about enabling backtracking? Why not define a non-greedy (.)+ rule
for anychars? I think the latter would match when the other rules don't.
I'm not sure 100%, but it is my impression that the generated parser
behaves a bit differently than when its in a different rule.
Tell me what that gives:
phrase
: bolded
| underlined
| ( options {greedy=false;} : .)+ ;
;
MA
Collin VanDyck wrote:
> Hello
>
> I'm trying to evaluate ANTLR to determine whether or not it would be a
> good fit for a wiki that we're currently developing.
>
> Essentially, the question boils down to how elegantly it would handle a
> wide variety of somewhat unstructured input. In other words, users are
> going to be entering in rather freeform content (i.e. copying and
> pasting form Word or some other character source), and I want ANTLR to
> be able to accept all of the input but match special sequences.
>
> An example of this would be:
>
> "This is some *bold* wiki content that might also be _underlined_ in
> places"
>
> The special rules would simply output each character that doesn't fall
> into a special rule, and then to recognize *bold* and _underlined_
> specially.
>
> I've written a small ANTLR grammar which is able to parse this, but
> fails pretty quickly when you do things like:
>
> "This is some *irregular** input_"
>
> In the latter case, I'd really just like for the first *irregular* to be
> parsed as a bolded word, and since the other characters don't have
> closing symbols, to be able to just treat them as fairly regular
> characters like 'a', 'b', 'c', etc.
>
> Is it possible and reasonable to use ANTLR for this purpose? Can I
> create a grammar which will accept ANYTHING, and simply be able to parse
> out the bits and pieces that are interesting?
>
> I'm pasting in the grammar I created. I apologize in advance for the
> incorrectness of it.
>
> -Collin
>
> ------------------
>
> grammar WikiGrammar;
>
> wiki
> : phrase+
> ;
>
> phrase
> : bolded
> | underlined
> | anychars
> ;
>
> bolded
> : ASTERISK phrase ASTERISK
> ;
>
> underlined
> : UNDERSCORE phrase UNDERSCORE
> ;
>
> anychars
> : (CHAR)+
> ;
>
> UNDERSCORE
> : '_'
> ;
>
> ASTERISK
> : '*'
> ;
>
> CHAR
> : .
> ;
>
>
>
>
--
Marc-André LAVERDIÈRE, B. Eng., M. A. Sc. (in progress)
Computer Security Laboratory - Laboratoire de sécurité informatique
CIISE, Université Concordia University, Montréal, Québec, Canada
www.ciise.concordia.ca
/"\
\ / ASCII Ribbon Campaign
X against HTML e-mail
/ \
"Perseverance must finish its work so that you may be mature and
complete, not lacking anything." -James 1:4
More information about the antlr-interest
mailing list