[antlr-interest] Embedding one language within another

Mon Apr 16 11:20:50 PDT 2007

Hi all,

I'm trying to write a recognizer for a Cheetah-like[1] templating  
language which effectively allows one language to be "embedded"  
within another. Templates are mostly plain text and only a few tokens  
have special meaning (directives starting with "#", placeholders  
starting with "$", and escape sequences starting with "\"). So that  
much is easy to lex/parse. The tricky bit is that many directives  
take Ruby expressions as parameters, and that means I have to parse  
at least a subset of Ruby as well.

I have a working prototype which is itself written in Ruby[2] but it  
is both slow and memory hungry (due to memoization) so I am now  
looking to re-implement the parser in compiled language, specifically  
using ANTLR targeting C so that I can incorporate the generated  
parser into a Ruby extension.

I'm new to ANTLR and have only been working on this for the last 24  
hours; I've read as much of the new ANTLR book as I can but I'm not  
really sure what the best approach is... My original pre-ANTLR  
implementation uses an integrated lexer/parser (not separate phases)  
and so can easily switch between Ruby and not-Ruby modes. But given  
that ANTLR uses two separate phases I am not aware of how to proceed:  
what constitutes a token is context-dependent depending on what the  
preceding tokens are; for example in the main body of the template  
"foo.bar" has no special meaning at all, but inside a Ruby section it  
is a message send (message "bar" sent to object "foo").

My lexer rules are starting to look nastily complicated and parser- 
like; in the end there'll be nothing left in the parser! Can I write  
two lexers and switch to the right one depending on what tokens  
arrive on the input stream? Is it likely that I'll be able to do this  
with a single lexer if I very carefully prioritize my rules (rule  
precendence is determined by order of appearance in the grammar file,  
right?). Is there some other way around this issue that I haven't  
thought of yet? I've seen some posts in the archives about parsing  
"here documents", which is a similar issue, but the posts in the  
archives are very old and I'm not sure how things stand in ANTLR v3.

Cheers,
Wincent

[1] http://cheetahtemplate.org/
[2] http://walrus.wincent.com/