[antlr-interest] case-insensitive parsing

Thu Apr 23 08:14:13 PDT 2009

Bob Sole wrote:
> I'm trying to write a parser for PL/SQL package header files but I'm 
> banging my head against the wall with a basic problem to do with 
> case-insensitive parsing. I'm using Jim Idle's NoCaseFileStream to 
> convert tokens into upper case, but I'm finding that the parser gets 
> confused when it comes across language keywords that are embedded 
> within comments. Here's some example input which has the OR keyword 
> embedded within the package comment. The "create or replace package" 
> statement is deliberately messed up - the parser handles this 
> correctly, but it stumbles against the first 'or' on line 2:
>
> /**
> blah blah or blah
> */
> create Or rePlace PACKAGE
> test IS
>
> Here's the grammar:
>
> grammar Test;
>
> input: statement+ ;
>
> statement: pkgComment | pkgStmt ;
>
> pkgComment: '/**' description '*/' ;
>
> pkgStmt: 'CREATE' ('OR' 'REPLACE') 'PACKAGE' ID ('AS' | 'IS')
>                {System.out.println("found package: "+$ID.text); }
>         ;
Don't use the 'OR' type constructs, move all these to lexer tokens and 
use OR, REPLACE,etc. You will see your problem a lot more clearly that way.
>
> descrption: (ID {System.out.println("
> description: ID="+$ID.text);})+ ;
>
> ID: Letter (Letter | Digit)* ;
>
> NUMBER: Digit Digit* ('.' Digit*)? ;
>
> fragment
> Digit: '0'..'9' ;
>
> fragment
> Letter: 'a'..'z' | 'A'..'Z' | '_' ;
>
You don't need the lower case elements of this rule as the comparison 
will only ever see UPPER case ;-)
> NL: ('\r'? '\n') { skip();} ;
> WS: (' '| '\t') {skip();} ;
>
> EVERYTHING_ELSE: . ;
>
>
> I get the following output which shows that the pkgStmt parsing is ok, 
> but the pkgComment isn't working:
>
> line 11:2 mismatched character '-' expecting '*'
Your pkgComment parser rule is only catering for ID and not the other 
fluff such as EVERYTHING_ELSE that might be in there. The comment rule 
should really be a complete lexer rule. Remember that the lexer runs 
first and turns everything into tokens, then the parser runs. You cannot 
get the parser to influence what the lexer returns. This error is from 
your lexer, saying that it does not know what to do with the '-' 
character. You must have this in your input somewhere, in fact at line 
11 as the third character on the line ;-)

> description: ID=blah
> description: ID=blah
> line 2:10 mismatched input 'or' expecting '*/'
> found package: test
>
> I'm slowly working my way through the book, and I've looked through 
> the wiki FAQs and postings here but haven't found anything that'll 
> help me (that I can understand, at least!) - any suggestions would be 
> most appreciated!
I think that your best bet is to look through the examples thoroughly 
before trying your own grammar.  This is really just happening to you 
because you are not familiar with it all just yet. Jumping in too early 
might discourage you. There is usually an "Ahhhh!" moment and you just 
haven't quite arrived there yet :-)

What is happening here is that your parser is dropping out of the 
pckgComment rule because there is the word 'or' in your comment. You 
package header rule is actually working.

But:

1) Write your lexer first (or at least as much as you can think of), and 
imagine that the lexer must cater for all sequences of characters 
because it runs first. When it does not, you get your 'mismatched 
character stuff'.
2) Create a token in your lexer for everything that you need in the 
parser. For now, do not use 'xxxx' in the parser as it will be difficult 
for you to visualize until you have the gestalt.
3) Make sure that your parser allows any valid token sequence. For 
instance, a lot more than ID can come through in your pckComment rule, 
and unless there are things in the comment that you need to see (in 
which case you possibly want an island grammar), then this should 
probably be a lexer rule.

Jim