[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Jim Idle jimi at temporal-wave.com
Wed Jun 6 19:06:54 PDT 2007



> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Wincent Colaiuta
> Sent: Wednesday, June 06, 2007 3:15 PM
 
> So what's the alternative? MediaWiki, for example, uses a very
> complicated set of hand-coded regular expressions. It works pretty
> well, but it does have its bugs and it's difficult to maintain. I'm
> hoping that the answer is not "hand-coded regular expressions"...


So, to make this perhaps more explicit, here is a grammar for the first
few constructs of WikiPedia/MediaWiki. The only thing left to do really
is deal with those constructs that allow other marked text within them,
for instance, if you can put BOLD text in headers or whatever. But,
other than that it is just a matter of building on this I think. You
could also avoid some overhead by not just predicating (marked).

Jim



grammar wiki;

body: text* EOF
	;
	
text: (marked)=>marked
	| .
	;
	
marked	
	:             IBM IBM space_text+ IBM IBM               //
Italic
	|         IBM IBM IBM space_text+ IBM IBM IBM           // BOLD
	| IBM IBM IBM IBM IBM space_text+ IBM IBM IBM IBM IBM	  //
BOLD and Italic
	
	|               EQ EQ space_text+ EQ EQ                 //
Heading
	|            EQ EQ EQ space_text+ EQ EQ EQ              // Level
2
	|         EQ EQ EQ EQ space_text+ EQ EQ EQ EQ           // Level
3
	|      EQ EQ EQ EQ EQ space_text+ EQ EQ EQ EQ EQ        // Level
4
	
	| LBRACKET 
		(
			  LBRACKET space_text+ (BAR space_text+)?
RBRACKET RBRACKET	// Internal link
			| HTTP DROSS+ WS+ DROSS space_text* RBRACKET
// External link with description
		)
	
	| HTTP ((DROSS)=> DROSS)+
	| HLINE
	| HYPHEN HYPHEN TILDE TILDE TILDE ((TILDE)=> TILDE)?
	| BULLET space_text+ EOL (BULLET space_text+ EOL)+
	;

space_text
	: DROSS
	| WS
	;
			
WS 			:	' ' | '\t' 		;
EOL			:  	'\r'? '\n' 		;
BULLET		:	'*' 			;
EQ			: 	'='				;
LBRACKET	:	'['				;
RBRACKET	:	']'				;
IBM			:	'\''			;
BAR			:	'|'				;
HTTP		:	('h' | 'H')('t' | 'T')('t' | 'T')('p' |
'P')'://'	;
HLINE		: '----'			;
HYPHEN		: '-'				;
TILDE		: '~'				;
DROSS		: . 				;



More information about the antlr-interest mailing list