[antlr-interest] Lexer doesn't agree with me (gives other tokens than I need)

Sun Apr 12 07:20:44 PDT 2009

Ok I think I see where you are having problems.

Leaving aside the AST output for a moment, you should have a look at the way SQL handles what are termed Value Expressions which is the basic container for specific types of expressions (character, boolean, numeric as well as primaries like column names and literals). 

The way it's defined in the SQL spec (and this rather simplifies things) is that you have a character_value_expression that handles the string concatentation operator and then resolves (via a chain of subrules) to either a parenthesized value_expression, a nonparenthesized_primary (column names, literals, a subquery, etc) or a character_value_function like RIGHT.

Functions are defined something like:

RIGHT LPAREN character_value_expression  ( COMMA numeric_value_expression )? RPAREN

That way you recursively parse the (arbitrarily deeply nested) expressions embedded within the function and will correctly handle things like RIGHT(LEFT(col_name, 1), (2-1)).  If you set it up this way there is no ambiguity at all with respect to the From Clause outer join syntax.

You don't need sophistication in the lexer.  You need the sophistication in the parser.  

Hope that makes sense.

If you google around, the BNF grammars for SQL 99, 2003 and 2008 are out there (at least in draft). 

Alex

Alexander Brown
Partner | Analytics 8 | Tel +61 2 9299 4430 | Mob +61 424 043 485| abrown at analytics8.com | www.analytics8.com

________________________________

From: Bill Mayfield [mailto:antlrinterest at gmail.com]
Sent: Sun 4/12/2009 22:36
To: Alexander Brown
Subject: Re: Lexer doesn't agree with me (gives other tokens than I need)

Hi Alexander,

Here is the definition of my "function" rule

function
    :    // LEFT and RIGHT keywords are also function names
        ( NonQuotedIdentifier | LEFT | RIGHT ) LPAREN ( functionArgument ( COMMA functionArgument )* )? RPAREN
    ;

I had to add the LEFT & RIGHT tokens because otherwise the parser doesn't recognize that LEFT(functionArgument, functionArgument) is also a function...

Prior to this I had defined another rule called "joinOn" that need the LEFT & RIGHT token types, see:

joinOn
    :    ( INNER    | ( LEFT | RIGHT | FULL {} ) ( OUTER )? )? JOIN
    ;

My AST subtree for function would ideally be: ^(FUNCTION NonQuotedIdentifier ^( FUNCTIONARGUMENTS functionArgument+ )? ) but now I have to make it ^(FUNCTION NonQuotedIdentifier? LEFT? RIGHT? ^( FUNCTIONARGUMENTS functionArgument+ )? ) which messes up the AST in my opinion...

I hope this makes my situation and problem more clear? Any help would be appreciated :o)

Regards,
Bill

On Sun, Apr 12, 2009 at 12:51 PM, Alexander Brown <abrown at analytics8.com> wrote:

	Hi,

	It's sort of an odd question in the sense that LEFT or RIGHT (either as outer join type specifiers or as character value functions in TSQL) are legtimate keywords rather than identifiers (like column and table names or schema qualifiers, etc).  There's no ambiguity at a parser level for those two scenarios though, so there isn't any need to force the lexer to generate an identifier in one scenario and a keyword in another.

	I can only imagine that you want to identify the keywords as identifiers for two reasons- either the DB doesn't constrain users from using keywords as identifiers (CREATE TABLE TABLE, for example) or that what you want in your AST is to produce as generic character function node for all character functions with a specific signature (function_name LPAREN character_value_expression COMMA numeric_value_expression RPAREN, for example).  Even in the latter scenario I don't think you really want to identify the function 'RIGHT' or 'LEFT' as an identifier.

	All this being said, you could probably could rewrite the AST to do what you want (haven't tried it though).  Maybe if you provide some more detail about what you are trying to achieve at the AST level perhaps I could suggest a way to achieve it?

	Alex

	Alexander Brown

	Partner | Analytics 8 | Tel +61 2 9299 4430 | Mob +61 424 043 485| abrown at analytics8.com | www.analytics8.com <http://www.analytics8.com/> 

________________________________

	Hi, 

	I'm creating a parser for a SQL dialect (sue me :oP) and I'm facing a problem regarding the lexer generating the wrong kind of token in a certain context. 

	Basically I have defined two tokens called LEFT & RIGHT which are needed to parse SQL joins (left outer join, right outer join, etc...) 

	LEFT : 'left' ; RIGHT : 'right' ; 

	The problem occurs when I'm matching the SQL *functions* LEFT & RIGHT. 

	LEFT (functionArgs) RIGHT (functionArgs) 

	I want the function name to be an IDENTIFIER token but no can do due to the lexer... It gives me a LEFT or RIGHT token obviously :'o( 

	What are the general recommendations you experienced ANTLR buffs can give me? The parser is generating an AST so I don't really care how it matches as long as I can keep my AST neat 'n tidy :o/ 

	Thanks! Bill 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090412/761938c9/attachment.html