[antlr-interest] Lookahead question

jose.sanleandro at ventura24.es jose.sanleandro at ventura24.es
Wed Dec 8 09:45:28 PST 2004



Hi all,

I've been trying to train myself enough so that ANTLR joins my particular 
toolbox. I have to say that it's been a challenging task, and, although I've 
been successful in some cases, I know I'm missing some important issues.

Recently I needed to write a "parser" able to "understand" the output format 
of "rlog" (used by "cvs log"). The format is simple, whose parts are mostly 
fixed. I thought it'd be good to train myself with such an easy grammar.
At the end, I've had to do some "strange" approaches to make it work as I want 
it to, by understanding and debugging the generated code, and by explicitly 
fixing the lookahead to 1.
The reason is that some of the grammar rules allow arbitrary texts, which 
ocassionally triggered conflicts with literals. I thought I had just the 
lookahead option to defeat such conflicts.
Finally, I decided to use a lookahead of one character, and explictly solve 
the conflicts. That ended up in a grammar which doesn't seem so :(. Take a 
look at a fragment:

STARTS_WITH_B:
    'b'
     (({ if  (   (LA(1) == 'r')
              && (LA(2) == 'a')
              && (LA(3) == 'n')
              && (LA(4) == 'c')
              && (LA(5) == 'h')
              && (LA(6) == ':'))
         {
           mRESERVED_BRANCH(false);
           $setType(LITERAL_BRANCH);
         }
         else
         {
           mSTRING(false);
           $setType(STRING);
         }
       })
    )
    ;

Basically, there's a non-protected rule for all starting letters of reserved 
words of the grammar, to guide the lexer in ambiguous situations.
I tried to use syntactic predicates, but after spending some time I wasn't 
able to make it generate the code I wanted, and in the same order.

I've used the lexer to just split words and distinguish them by assigning 
different token identifiers. For me, it's role is similar to a specialized 
SAX parser which creates ANTLR objects (tokens) and optionally custom logic, 
defined in the grammar itself. If it fails, the input is not "valid".
On the other hand, the parser expects the correct tokens in the correct order, 
following certain rules. It optionally creates DOM-like structures. If it 
fails, the input is not "well-formed".
Finally, the tree parser just processes such object hierarchy (defined by the 
parser), and provide features such as what xpath or xsl stylesheets could 
perform. Is the analogy valid?

Moreover, which is the main drawback of explicitly resolving the ambiguous 
situations for the lexer using inline LA(x) checks?

Thank you.
Jose.





 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
    antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 





More information about the antlr-interest mailing list