[antlr-interest] collecting tokens without invoking parser rules...

Alan Lehotsky ALehotsky at ABINITIO.COM
Mon Jan 17 13:40:02 PST 2011

Using Antlr 3.2 with language=C as a target

For parsing Teradata's stored-procedure language (SPL), we have the issue 
of context-sensitive token hiding.

I'm trying to use rules for SQL statements embedded in SPL that just 
swallow the tokens, so we have rules like:

        swallow_to_semi :   ~ (  SEMI  ) * ;

                update_stmt :  UPDATE swallow_to_semi;

We take the stream of tokens from this UPDATE rule and pass them off to an 
existing SQL parser.

But, because SPL has an assignment statement rule that looks like

                assignment_stmt :  SET  dotted_name '='  expression SEMI;

and teradata SQL uses 'SET' within its own grammar, when I encounter a 
source statement like

               update mytable  set x = y, a = b where a = 'none' ;

I get an error that makes it clear to me that the Antlr parser is 'seeing' 
the 'set' and trying to invoke the assignment_stmt rule.
because the complaint is about expecting a "SEMI" at the source position 
where the comma is.

I don't think that redirecting EVERYTHING in the lexer after the UPDATE to 
an alternate channel will work in all cases, because there are other 
context sensitivities in play - for example:

SELECT has to read everything to a SEMI when it appears in a statement 
context, but when there is a select clause in a FOR statement, it must 
read upto a USING, FOR, DO or SEMI token.

So, what I tried so far was code that looks like 

  static ANTLR3_BOOLEAN semicolonMatch ( pplsqlParser ctx, pANTLR3_VECTOR 
& tokens)
    pANTLR3_PARSER parser = ctx->pParser;
    pANTLR3_TOKEN_STREAM ts = parser->getTokenStream(parser);
    ANTLR3_INT32 tok;
    if( ! tokens)      // If we didn't have a token list, start one now
      tokens = ctx->vectors->newVector( ctx->vectors);

    if (LA(0) == SEMI) return false; // e.g. "COMMIT ;"

    while( ( tok=LA( 1) ) != EOF)
      switch( tok)
        case SEMI:       return true; 
        case EOF:        return false;
          tokens->add( tokens, LT( 1), NULL);
          ts->istream->consume( ts->istream);
    return false;

And a modified swallow_to_semi rule that looks like

     swallow_to_semi :  tokenlist+=( {semicolonMatch(ctx, $tokenlist) }? ) 
-> $tokenlist+

but that doesn't work correctly because it seems to preemptively swallow 
the SEMI and a statement like



This feels like something that should be relatively easy to do, but I 
don't seem to be able to figure out exactly how to make it happen and I 
haven't hit upon the right search terms to find an appropriate example in 
the Antlr-interest archives or the Wiki.

