[antlr-interest] Re: yet another syntactic predicate problem

Fri Jun 6 11:23:40 PDT 2003

If I have well understood, your problem is not in the parser, but in 
the lexer.

At first I could't understand your PCDATA rule, due to your 
identation system, which seems certainly different to the one I use. 
After a bit of "beautifying", the rule became like this:

PCDATA
  : ( { LA(2)!='.' }? ('a'|'b'|'c'|'d'|'e') )
  | ~('a'|'b'|'c'|'d'|'e'|'<'|'>')
     ( options { generateAmbigWarnings=false; }
       : '\r' '\n' { newline(); }
       | '\r'      { newline(); }
       | '\n'      { newline(); }
       | ~('<'|'\n'|'\r'|'"'|'>')
     )*
  ;

This rule has two alternatives: the first is activated when LA(2)!
='.' and LA(1) is in {a,b,c,d,e}.
The second alternative is activated when LA(1) is NOT in 
{a,b,c,d,e,<,>}. So this is the alternative fired when encountering " 
B. title"; the initial space is in that group.

This means that PCDATA is recognized - by its second alternative - 
when the entrance " B. title" is supplied to the lexer.

I found the problem, but not the solution. Sorry. Anyway I adventure 
to give you some advices:

1. You can try using two different lexers, one for parsing tags and 
other for text, sharing the inputstate.
2. Don't use optional operator '?' in combination with '+'. Use '*' 
instead - the implementation is a bit different, and sometimes this 
bit makes things work.
3. Ident your grammars a bit more, please! :)

Cheers,

Enrique

--- In antlr-interest at yahoogroups.com, "pcristip" <pcristip at y...> 
wrote:
> Hi,
> 
> I'd really appreciate some help on the following problem:
> 
> I have a grammar that is supposed to parse the body of a html file, 
> and everything is ok except the fact that the data between the tags 
> should be of two kinds:
> 1. a topic which is normal text except that it should start with a 
> letter (in fact only a through e letters) followed by a '.' char 
> (e.g." A.")
> 2. normal cdata which is the case if the text is not a topic
> 
> I managed to make this work except for the case when the text 
starts 
> with spaces.
> 
> Now I have something like:
> (in the parser)
> topic		:	TOPICID^ topicbody
> 			;
> 
> topicbody	:
> 				(	options { greedy=true; }
> 					:
> 					text | font
> 				)*
> 			;
> text		:	PCDATA ;
> 
> 
> (in the lexer)
> 
> TOPICID
> 			:
> 				('a' | 'b' | 'c' | 'd' | 'e') '.'
> 			;
> 
> 
> PCDATA
> 			:
> 				({ LA(2)!='.' }? 
> ('a'|'b'|'c'|'d'|'e'))	| ~('a'|'b'|'c'|'d'|'e'|'<'|'>')
> 				(
> 					options {
> 					
> 	generateAmbigWarnings=false;
> 					}
> 				:	'\r' '\n'	
> 	{newline();}
> 				|	'\r'		
> 	{newline();}
> 				|	'\n'		
> 	{newline();}
> 				|	~('<'|'\n'|'\r'|'"'|'>')
> 				)*
> 			;
> 
> 
> which works ok for texts like "A. some text here" and "A normal 
text"
> but if there are spaces in front like "   B. title" then the text 
is 
> matched as data not as a topic.
> 
> I tried to solve this by modifing the topic rule but no luck. And 
> thought that the best solution would be to use syntactic predicates 
> because the lookahead is not fixed in this case (the number of 
spaces 
> can be arbitrary before you can tell which rule to match).
> So I got to this construct:
> 
> topic_or_answer : ((spaces)? TOPICID) => topic
>                 |  text
>                 ;
> 
> spaces          : WS ;
> 
> (and in lexer)
> protected
> WS			:	(
> 					options {
> 					
> 	generateAmbigWarnings=false;
> 					}
> 				:	' '
> 				|	'\t'
> 				|	'\n'	{ newline(); }
> 				|	"\r\n"	{ newline(); }
> 				|	'\r'	{ newline(); }
> 				)+
> 			;
> 
> 
> which doesn't work (otherwise you wouldn't read these lines :) ).
> 
> Can someone give me a hint ? What did I do wrong ?
> 
> Thanks,
> Chris

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/