[antlr-interest] Re: yet another syntactic predicate problem
antlrlist
antlrlist at yahoo.com
Fri Jun 6 11:23:40 PDT 2003
If I have well understood, your problem is not in the parser, but in
the lexer.
At first I could't understand your PCDATA rule, due to your
identation system, which seems certainly different to the one I use.
After a bit of "beautifying", the rule became like this:
PCDATA
: ( { LA(2)!='.' }? ('a'|'b'|'c'|'d'|'e') )
| ~('a'|'b'|'c'|'d'|'e'|'<'|'>')
( options { generateAmbigWarnings=false; }
: '\r' '\n' { newline(); }
| '\r' { newline(); }
| '\n' { newline(); }
| ~('<'|'\n'|'\r'|'"'|'>')
)*
;
This rule has two alternatives: the first is activated when LA(2)!
='.' and LA(1) is in {a,b,c,d,e}.
The second alternative is activated when LA(1) is NOT in
{a,b,c,d,e,<,>}. So this is the alternative fired when encountering "
B. title"; the initial space is in that group.
This means that PCDATA is recognized - by its second alternative -
when the entrance " B. title" is supplied to the lexer.
I found the problem, but not the solution. Sorry. Anyway I adventure
to give you some advices:
1. You can try using two different lexers, one for parsing tags and
other for text, sharing the inputstate.
2. Don't use optional operator '?' in combination with '+'. Use '*'
instead - the implementation is a bit different, and sometimes this
bit makes things work.
3. Ident your grammars a bit more, please! :)
Cheers,
Enrique
--- In antlr-interest at yahoogroups.com, "pcristip" <pcristip at y...>
wrote:
> Hi,
>
> I'd really appreciate some help on the following problem:
>
> I have a grammar that is supposed to parse the body of a html file,
> and everything is ok except the fact that the data between the tags
> should be of two kinds:
> 1. a topic which is normal text except that it should start with a
> letter (in fact only a through e letters) followed by a '.' char
> (e.g." A.")
> 2. normal cdata which is the case if the text is not a topic
>
> I managed to make this work except for the case when the text
starts
> with spaces.
>
> Now I have something like:
> (in the parser)
> topic : TOPICID^ topicbody
> ;
>
> topicbody :
> ( options { greedy=true; }
> :
> text | font
> )*
> ;
> text : PCDATA ;
>
>
> (in the lexer)
>
> TOPICID
> :
> ('a' | 'b' | 'c' | 'd' | 'e') '.'
> ;
>
>
> PCDATA
> :
> ({ LA(2)!='.' }?
> ('a'|'b'|'c'|'d'|'e')) | ~('a'|'b'|'c'|'d'|'e'|'<'|'>')
> (
> options {
>
> generateAmbigWarnings=false;
> }
> : '\r' '\n'
> {newline();}
> | '\r'
> {newline();}
> | '\n'
> {newline();}
> | ~('<'|'\n'|'\r'|'"'|'>')
> )*
> ;
>
>
> which works ok for texts like "A. some text here" and "A normal
text"
> but if there are spaces in front like " B. title" then the text
is
> matched as data not as a topic.
>
> I tried to solve this by modifing the topic rule but no luck. And
> thought that the best solution would be to use syntactic predicates
> because the lookahead is not fixed in this case (the number of
spaces
> can be arbitrary before you can tell which rule to match).
> So I got to this construct:
>
> topic_or_answer : ((spaces)? TOPICID) => topic
> | text
> ;
>
> spaces : WS ;
>
> (and in lexer)
> protected
> WS : (
> options {
>
> generateAmbigWarnings=false;
> }
> : ' '
> | '\t'
> | '\n' { newline(); }
> | "\r\n" { newline(); }
> | '\r' { newline(); }
> )+
> ;
>
>
> which doesn't work (otherwise you wouldn't read these lines :) ).
>
> Can someone give me a hint ? What did I do wrong ?
>
> Thanks,
> Chris
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list