[antlr-interest] unexpected behavior in splitter grammar

Thu Jun 24 10:10:53 PDT 2010

Greetings!

On Thu, 2010-06-24 at 16:23 +0200, Scherer Markus wrote:
> I am currently working on a grammar, that converts SQL*PLUS scripts in JDBC compatible statements.
> Basically I am separating the different statements in
> 
> * normal SQL statements (to be JDBC compatible, the trailing semicolon must be removed)
> * PL/SQL statements (or more precise: statement that require a trailing “END;”)
> * comments (I keep them for metadata or something)
> 
> Furthermore, the SQL*PLUS specific “/” is recognized.
> 
> I simply tried imitate the behavior of multiline-comments, since I am for now not interested in the inner structure of the statements, however the appended grammar yields many errors like the following while parsing
> 
> line 4:27 mismatched character 'e' expecting set null
> line 4:29 no viable alternative at character 't'
> line 4:30 mismatched character ';' expecting set null
> line 5:7 mismatched character 'R' expecting set null
> 
> Besides of this, the grammar does what I want it to do, but I don’t really trust it.
> I appended a test-file, that gets recognized the way I want.
> 

I received similar errors also using your grammar and test file but at
different lines and columns than you report. I hope that maybe your
reported errors are (hopefully) from some other input file? also I do
not use ANTLRWorks so maybe there is a difference therein (you appear to
be using the ANTLRWorks interpreter? - if so don't)

anyway.....

ANTLR lexers commit themselves to a set of possible tokens as soon as a
valid prefix for that set of tokens is encountered. the lexer will not
backup if the prefix turns out to be non viable.

in your case you have, for example, the keywords INDEX and INSERT. and
your input contains the word INTO. so when the lexer sees the I and N of
the INTO; it assumes that either a D or an S should appear next, but a T
appears; thus an error is emitted.

a simple fix is to add a lexer rule that will match any sequence of
letters. the lexer will then be able to recognize that token for cases
similar to the above. so add this lexer rule (at the very end of the
lexer):

WORD : ('a'..'z'|'A'..'Z')+ ;

the lexer will now announce that the input INTO is a WORD token and life
is good...

hope this helps...
   -jbb