[antlr-interest] Is ANTLR suitable for wiki grammar parsing?

Thu Jun 7 02:56:35 PDT 2007

El 7/6/2007, a las 4:06, Jim Idle escribió:

> grammar wiki;
>
> body: text* EOF
> 	;
> 	
> text: (marked)=>marked
> 	| .
> 	;
> 	
> marked	
> 	:             IBM IBM space_text+ IBM IBM               //
> Italic
> 	|         IBM IBM IBM space_text+ IBM IBM IBM           // BOLD
> 	| IBM IBM IBM IBM IBM space_text+ IBM IBM IBM IBM IBM	  //
> BOLD and Italic
> 	
> 	|               EQ EQ space_text+ EQ EQ                 //
> Heading
> 	|            EQ EQ EQ space_text+ EQ EQ EQ              // Level
> 2
> 	|         EQ EQ EQ EQ space_text+ EQ EQ EQ EQ           // Level
> 3
> 	|      EQ EQ EQ EQ EQ space_text+ EQ EQ EQ EQ EQ        // Level
> 4
> 	
> 	| LBRACKET
> 		(
> 			  LBRACKET space_text+ (BAR space_text+)?
> RBRACKET RBRACKET	// Internal link
> 			| HTTP DROSS+ WS+ DROSS space_text* RBRACKET
> // External link with description
> 		)
> 	
> 	| HTTP ((DROSS)=> DROSS)+
> 	| HLINE
> 	| HYPHEN HYPHEN TILDE TILDE TILDE ((TILDE)=> TILDE)?
> 	| BULLET space_text+ EOL (BULLET space_text+ EOL)+
> 	;
>
> space_text
> 	: DROSS
> 	| WS
> 	;
> 			
> WS 			:	' ' | '\t' 		;
> EOL			:  	'\r'? '\n' 		;
> BULLET		:	'*' 			;
> EQ			: 	'='				;
> LBRACKET	:	'['				;
> RBRACKET	:	']'				;
> IBM			:	'\''			;
> BAR			:	'|'				;
> HTTP		:	('h' | 'H')('t' | 'T')('t' | 'T')('p' |
> 'P')'://'	;
> HLINE		: '----'			;
> HYPHEN		: '-'				;
> TILDE		: '~'				;
> DROSS		: . 				;

Any ideas how to make this grammar handle inputs like this one?

   of all the protocols, http is my favorite

Here the generated lexer will complain when it sees "http" followed  
by a space:

   mismatched character ' ' expecting ':'

I tried a number of ways to cope with this. For example, changing the  
lookahead length of the HTTP rule:

   HTTP
   options { k=7; }
     : ('h' | 'H')('t' | 'T')('t' | 'T')('p' | 'P') '://' ;

Naturally that doesn't help because k sets the upper limit on  
lookahead, it doesn't set a minimum lookahead that must be taken  
before making a prediction.

I also tried turning backtracking on for that rule (no effect):

   HTTP
   options { backtrack=true; }
     : ('h' | 'H')('t' | 'T')('t' | 'T')('p' | 'P') '://' ;

I also tried using a syntactic predicate (no effect):

   HTTP: (HTTP_FRAG)=> HTTP_FRAG;
   fragment HTTP_FRAG: ('h' | 'H')('t' | 'T')('t' | 'T')('p' | 'P')  
'://' ;

I also tried the longer, less readable form written without using a  
fragment:

   HTTP: (('h' | 'H')('t' | 'T')('t' | 'T')('p' | 'P') '://')=> ('h'  
| 'H')('t' | 'T')('t' | 'T')('p' | 'P') '://';

I gather that a syntactic predicate doesn't help here because it's  
only used for prioritizing alternatives, and here there are no  
alternatives from which to choose.

I understand that gated *semantic* predicates, on the other hand, can  
be used because they are not so much about prioritizing alternatives  
as about turning alternatives on and off. But semantic predicates  
seem quite difficult to use here and I'm not sure if it's a good idea  
or not; I feel like I am definitely abusing the feature rather than  
using it the way it was intended:

   HTTP
   : {
       (input.LA(1) == 'h' || input.LA(1) == 'H') &&
       (input.LA(2) == 't' || input.LA(2) == 'T') &&
       (input.LA(3) == 't' || input.LA(3) == 'T') &&
       (input.LA(4) == 'p' || input.LA(4) == 'P') &&
       input.LA(5) == ':' &&
       input.LA(6) == '/' &&
       input.LA(7) == '/'
     }?=> ('h' | 'H')('t' | 'T')('t' | 'T')('p' | 'P') '://' ;

Finally there is the option of using "filter=true"; that also works,  
but it cripples the parser (actions in the parser don't run, @after  
blocks don't run, only @init blocks run; there may be other issues  
too but those are the ones I've empirically verified so far). It  
seems that Ter never intended that option to be used with a lexer- 
plus-parser, only with a lexer on its own.

So what's the best solution here? Handling strings like "the http  
protocol" is pretty important for a wiki.

I'm hoping there's something very simple that I can do to effectively  
do a "filter = true" in the lexer, without breaking the parser (I  
tried using separate files for the lexer and parser but I couldn't  
get it to work). The readme says "filter=true for lexers turns on k=1  
and backtracking for every token alternative.  Put the rules in  
priority order."; I tried literally doing that (passing filter=true  
and k=1 via an options block to each and every lexer rule), but the  
behaviour didn't change.

Cheers,
Wincent