[antlr-interest] Oracle SQL Parser

Thu Jun 14 23:24:30 PDT 2012

Hi Rick,

You're welcome.
I took the liberty to CC the list for you.

Cheers,

Bart.

On Fri, Jun 15, 2012 at 12:18 AM, Brown, Rick <RBROWN at allstate.com> wrote:

> **
> Hi Bart,
>
> Thanks again for the reply, and thanks for taking the time to create a
> test grammar using the "comment" definition.  As you ably demonstrate, the
> grammar is fine, so I did some more digging to find the problem.  It turns
> out to be related to my PowerShell script.  I was using the Get-Content
> cmdlet to obtain the text from a sql file.  However, this cmdlet does not
> preserve newlines; it in fact creates an array of strings, one for each
> line of text.  When this is passed to the parser, the parser ends up seeing
> just one long string, with no newlines included, so the end of the comment
> is never found until it reaches end-of-file.
>
> The fix was to change this line:
>    $sql = Get-Content $sqlFileName
>
> to this line:
>    $sql = [system.io.file]::readalltext($sqlFileName)
>
> and everything now works fine.
>
> Thank you very much for your help!
>
> ~ Rick
>
> p.s. I would post this response to the group, but I am not sure how to
> properly send the email in a way that will associate it as being a response
> to a particular message.  Do I just use the same Subject line above and
> send it, or do I need to do something else?
>
>  ------------------------------
> *From:* Bart Kiers [mailto:bkiers at gmail.com]
> *Sent:* Thursday, June 14, 2012 4:54 AM
> *To:* Brown, Rick
> *Cc:* antlr-interest at antlr.org interest
>
> *Subject:* Re: [antlr-interest] Oracle SQL Parser
>
> Hi Rick,
>
> You can test the fact that the comments work by running the following
> grammar through ANTLRWorks' debugger:
>
> grammar T;
>
> parse
>  : select EOF
>  ;
>
> select
>  : Select column=Id From table=Id {System.out.println("column=" +
> $column.text + ", table=" + $table.text);}
>  ;
>
> Select : 'select';
> From   : 'from';
> Id     : ('a'..'z' | 'A'..'Z')+;
>
> Comment
>  : '--' ~('\n'|'\r')* '\r'? '\n' { $channel=HIDDEN; }
>  | '//' ~('\n'|'\r')* '\r'? '\n' { $channel=HIDDEN; }
>  | '/*' ( options {greedy=false;} : . )* '*/' { $channel=HIDDEN; }
>  ;
>
> Space
>  : (' ' | '\t' | '\r' | '\n')+ {skip();}
>  ;
>
> Simply press CTRL+D in ANTLRWorks and use the following input to debug
> (also make sure ANTLRWorks starts with the 'parse' rule!):
>
> select
> // comment 1
> dt
> -- comment 2
> from
> /* comment 3 */
> dual
>
> You will then see the following output (you need to select the "output"
> tab to see it):
>
> column=dt, table=dual
>
> Regards,
>
> Bart.
>
>
> On Wed, Jun 13, 2012 at 9:31 PM, Brown, Rick <RBROWN at allstate.com> wrote:
>
>> **
>> Bart,
>>
>> Thanks very much for the reply!  To provide some more context, I am
>> trying to parse Oracle SQL files using a combination of PowerShell and a
>> parser from a project by MacroScope.  The basic approach I am using is
>> described in this article:
>> http://sev17.com/2010/04/parsing-sql-for-table-names/.
>>
>> The MacroScope parser is based on Anltr3, and the project includes the
>> grammar file as well as the generated parser and lexer files.  I
>> contacted MacroScope, and they confirmed that they did not include support
>> for comments at all in their implementation (it is rather old).  So, I set
>> about trying to find a replacement Oracle SQL grammar file that included
>> comments, which led me to the file I mentioned in my original post.
>>
>> I am generating the lexer and parser files using the above grammar, and
>> then adding them to the build of the MacroScope parser.  I then use the
>> resulting MacroScope classes in a PowerShell script to extract the names
>> of the tables and columns in a given sql file.
>>
>> Here are a few examples of what happens when I use this method for
>> parsing a sql file:  The first file looks like this:
>>
>> select sysdate dt
>> /* This is a comment */
>>   from dual
>>
>> Running my PowerShell/MacroScope script, I get the column name as 'dt'
>> and the table name as 'dual', which is what I would expect.
>>
>> If I run the same script using this sql file:
>>
>> select sysdate dt
>> // This is a comment
>>   from dual
>>
>> I get the column name as 'dt', but I do not get any table name.
>>
>> If I run the script using this sql file:
>>
>> -- This is a comment
>> select sysdate dt from dual
>>
>> The script generates an error indicating that there is no data to parse.
>>
>> My assumption is that everything after the start of the comment is being
>> included as part of the comment, all the way through the end of the file,
>> rather than stopping at the end of the line.  Again, I am not an expert, so
>> this is only my assumption based on the behavior I am seeing.  It could
>> be that the problem is something else entirely, but it seems odd that the
>> multi-line style of comments (/* */) is processed correctly, but the
>> single-line style (// or --) is not.
>>
>> Thanks again for your help!
>>
>> ~ Rick
>>
>>  ------------------------------
>> *From:* Bart Kiers [mailto:bkiers at gmail.com]
>> *Sent:* Wednesday, June 13, 2012 3:30 AM
>> *To:* Brown, Rick
>> *Cc:* antlr-interest at antlr.org
>> *Subject:* Re: [antlr-interest] Oracle SQL Parser
>>
>>   On Wed, Jun 13, 2012 at 12:20 AM, Brown, Rick <RBROWN at allstate.com>wrote:
>>
>>> ... but the problem seems to be that the
>>> comment is not being ended at the end of a line; instead, it seems to
>>> consider everything that follows, the entire rest of the file, as part
>>> of the comment. ...
>>>
>>
>> I don't see anything wrong with the rule. The Comment rule has 3
>> alternatives:
>>
>> Comment
>>   :  '--' ~('\n'|'\r')* '\r'? '\n' { $channel=HIDDEN;
>> }                         // alternative 1
>>   |  '//' ~('\n'|'\r')* '\r'? '\n' { $channel=HIDDEN;
>> }                         // alternative 2
>>   |  '/*' ( options {greedy=false;} : . )* '*/' { $channel=HIDDEN; }  //
>> alternative 3
>>   ;
>>
>> Alternative 1 and 2 must end with a line break, and alternative 3 must
>> end with '*/'.
>>
>> Could you tell how you came to the conclusion that there's a problem with
>> the rule? Do you have input that gets tokenized incorrectly? If so, could
>> you send it to the list as well?
>>
>> Regards,
>>
>> Bart.
>>
>
>