[antlr-interest] case-insensitive parsing
Jim Idle
jimi at temporal-wave.com
Thu Apr 23 08:14:13 PDT 2009
Bob Sole wrote:
> I'm trying to write a parser for PL/SQL package header files but I'm
> banging my head against the wall with a basic problem to do with
> case-insensitive parsing. I'm using Jim Idle's NoCaseFileStream to
> convert tokens into upper case, but I'm finding that the parser gets
> confused when it comes across language keywords that are embedded
> within comments. Here's some example input which has the OR keyword
> embedded within the package comment. The "create or replace package"
> statement is deliberately messed up - the parser handles this
> correctly, but it stumbles against the first 'or' on line 2:
>
> /**
> blah blah or blah
> */
> create Or rePlace PACKAGE
> test IS
>
> Here's the grammar:
>
> grammar Test;
>
> input: statement+ ;
>
> statement: pkgComment | pkgStmt ;
>
> pkgComment: '/**' description '*/' ;
>
> pkgStmt: 'CREATE' ('OR' 'REPLACE') 'PACKAGE' ID ('AS' | 'IS')
> {System.out.println("found package: "+$ID.text); }
> ;
Don't use the 'OR' type constructs, move all these to lexer tokens and
use OR, REPLACE,etc. You will see your problem a lot more clearly that way.
>
> descrption: (ID {System.out.println("
> description: ID="+$ID.text);})+ ;
>
> ID: Letter (Letter | Digit)* ;
>
> NUMBER: Digit Digit* ('.' Digit*)? ;
>
> fragment
> Digit: '0'..'9' ;
>
> fragment
> Letter: 'a'..'z' | 'A'..'Z' | '_' ;
>
You don't need the lower case elements of this rule as the comparison
will only ever see UPPER case ;-)
> NL: ('\r'? '\n') { skip();} ;
> WS: (' '| '\t') {skip();} ;
>
> EVERYTHING_ELSE: . ;
>
>
> I get the following output which shows that the pkgStmt parsing is ok,
> but the pkgComment isn't working:
>
> line 11:2 mismatched character '-' expecting '*'
Your pkgComment parser rule is only catering for ID and not the other
fluff such as EVERYTHING_ELSE that might be in there. The comment rule
should really be a complete lexer rule. Remember that the lexer runs
first and turns everything into tokens, then the parser runs. You cannot
get the parser to influence what the lexer returns. This error is from
your lexer, saying that it does not know what to do with the '-'
character. You must have this in your input somewhere, in fact at line
11 as the third character on the line ;-)
> description: ID=blah
> description: ID=blah
> line 2:10 mismatched input 'or' expecting '*/'
> found package: test
>
> I'm slowly working my way through the book, and I've looked through
> the wiki FAQs and postings here but haven't found anything that'll
> help me (that I can understand, at least!) - any suggestions would be
> most appreciated!
I think that your best bet is to look through the examples thoroughly
before trying your own grammar. This is really just happening to you
because you are not familiar with it all just yet. Jumping in too early
might discourage you. There is usually an "Ahhhh!" moment and you just
haven't quite arrived there yet :-)
What is happening here is that your parser is dropping out of the
pckgComment rule because there is the word 'or' in your comment. You
package header rule is actually working.
But:
1) Write your lexer first (or at least as much as you can think of), and
imagine that the lexer must cater for all sequences of characters
because it runs first. When it does not, you get your 'mismatched
character stuff'.
2) Create a token in your lexer for everything that you need in the
parser. For now, do not use 'xxxx' in the parser as it will be difficult
for you to visualize until you have the gestalt.
3) Make sure that your parser allows any valid token sequence. For
instance, a lot more than ID can come through in your pckComment rule,
and unless there are things in the comment that you need to see (in
which case you possibly want an island grammar), then this should
probably be a lexer rule.
Jim
More information about the antlr-interest
mailing list