[antlr-interest] Embedding one language within another

Mon Apr 16 12:03:07 PDT 2007

Take a look at the example of island grammars in the downloadable
examples. I have yet to convert these to C, but it isn't difficult, just
time (and probably some bug fixes ;-).

It sounds like you can use this as a template for your work. The key is
that you have to trigger this from the lexer as ANTLR will not let you
do syntax directed lexing (have the parser tell the lexer what to do
next). 

But say your directive is:

Nyscriptlang
Myscriptland
#doruby ruby-expression-that does-something
Myscriptlang
Etc

Then your lexer should see #doruby and call another lexer/parser
combination. It can then skip the line, or return it as a special token
like DIRECTIVE or something. Here is an example (using C target of doing
this, for a language I had to paresr where there was a pattern matching
operator, which was too stupid for words and basically needed a second
grammar to deal with it (which was then a trivial grammar):

In the first parser/lexer grammar I have (in the lexer spec):

OPMATCH		: '?' 
		{
		    /* To make parsing of COS pattern matching simpler
and so that the COS
		     * parser can also validate the patterns, we call an
island grammar that
		     * feeds off the current input stream.
		     */
		    cosPatternLexer	cl	    = new
cosPatternLexer(input);
		    CommonTokenStream	patTok    = new
CommonTokenStream(cl);
		    cosPatternParser	cp	    = new
cosPatternParser(patTok);
		    cp.pattern();

                    /* Whatever character (1 + recover consume) caused
the end of the pattern parse, we can guarantee
                     * that it was meant for this parser, so we seek
back to it.
                     */
                    input.seek(input.index()-2);

		    ... destroy parser we made on the fly, skip this
token in this lexer, etc...

Note that I just reuse the current input stream (input is predefined for
you) in the lexer rule, then create a new parser on the fly (which you
might be able to cache rather than create and destroy each time). The
parser I called will actually return an AST for the pattern, but you can
do whatever you want of course. The grammar that was called is
completely separate grammar and nows nothing of the caller. It just
stops parsing at some point (in your case, probably EOL of some sort).

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Wincent Colaiuta
Sent: Monday, April 16, 2007 11:21 AM
To: antlr-interest at antlr.org
Subject: [antlr-interest] Embedding one language within another

Hi all,

I'm trying to write a recognizer for a Cheetah-like[1] templating  
language which effectively allows one language to be "embedded"  
within another. Templates are mostly plain text and only a few tokens  
have special meaning (directives starting with "#", placeholders  
starting with "$", and escape sequences starting with "\"). So that  
much is easy to lex/parse. The tricky bit is that many directives  
take Ruby expressions as parameters, and that means I have to parse  
at least a subset of Ruby as well.

I have a working prototype which is itself written in Ruby[2] but it  
is both slow and memory hungry (due to memoization) so I am now  
looking to re-implement the parser in compiled language, specifically  
using ANTLR targeting C so that I can incorporate the generated  
parser into a Ruby extension.

I'm new to ANTLR and have only been working on this for the last 24  
hours; I've read as much of the new ANTLR book as I can but I'm not  
really sure what the best approach is... My original pre-ANTLR  
implementation uses an integrated lexer/parser (not separate phases)  
and so can easily switch between Ruby and not-Ruby modes. But given  
that ANTLR uses two separate phases I am not aware of how to proceed:  
what constitutes a token is context-dependent depending on what the  
preceding tokens are; for example in the main body of the template  
"foo.bar" has no special meaning at all, but inside a Ruby section it  
is a message send (message "bar" sent to object "foo").

My lexer rules are starting to look nastily complicated and parser- 
like; in the end there'll be nothing left in the parser! Can I write  
two lexers and switch to the right one depending on what tokens  
arrive on the input stream? Is it likely that I'll be able to do this  
with a single lexer if I very carefully prioritize my rules (rule  
precendence is determined by order of appearance in the grammar file,  
right?). Is there some other way around this issue that I haven't  
thought of yet? I've seen some posts in the archives about parsing  
"here documents", which is a similar issue, but the posts in the  
archives are very old and I'm not sure how things stand in ANTLR v3.

Cheers,
Wincent

[1] http://cheetahtemplate.org/
[2] http://walrus.wincent.com/