[antlr-interest] Stupid languages, and parsing them

Mon Apr 13 08:29:33 PDT 2009

What I do to handle SQL embedded inside C, C++, and COBOL is have
a simple preprocessor that replaces all embedded code with a single,
unique token (e.g. "SQLTOKEN123") and stores the embedded code in
a hashtable (e.g. "SQLTOKEN123" -> "EXEC SQL SELECT * FROM EMPLOYEE END-EXEC").
The lexing and parsing of the embedded code is completely separate
from the outer code. Then, at the end, the unique token is replaced
with something new (in my case, embedded SQL translated to Java JDBC calls).

Andy

Sam Barnett-Cormack wrote:
> Hi all,
> 
> In my ongoing project, I need to parse a really crazy structure that 
> wants to change the lexing rules dependent on syntactic factors. I hate 
> this.
> 
> Within the thing I'm talking about, whitespace and comments are handled 
> as they are the rest of the time (thankfully). Alphanumeric tokens are 
> all one type, and commas are allowed, and '[' and '{' (and closing 
> versions of such) have special meaning. Then there's things that are 
> &whatever ('&' followed by alphabetic followed by any number of 
> alphanumeric). Those are already distinct types. However, once into this 
> weird 'zone', most keywords aren't keywords anymore and must be treated 
> as alphanumeric tokens.
> 
> Now, this state is entered by 'WITH SYNTAX {' (and exited with '}')
> 
> The problem is the specification considers the starter to be three 
> tokens, and any amount of whitespace and comments is allowed between 
> each. I can easily see that I could use gated predicates to switch 
> between two lexer "modes". That's one solution. I can see two broad 
> solutions:
> 
> 1) Use member variables to track if the most recent non-WS, non-comment 
> token was WITH, SYNTAX, and { (a sort of look-behind implemented 
> kludgily by putting an action in *every* rule, or by overruling the emit 
> stuff to keep track of the last 2 things on the DEFAULT channel), use 
> these to switch into crazy-mode where much is different.
> 
> 2) Make the parser just accept *everything* within the definition of 
> syntax, and deal with in some other way (????) later. It has to be that 
> bad, as the "normal" lexer sees '[[' as a token, and the "weird" version 
> has to see it as two '[' tokens.
> 
> Anyone got any thoughts? Any ideas which would be less pain? Is there 
> already some way of tracking recently-emitted token on a specific channel?
> 
> Thanks,
> 
> Sam
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>