[antlr-interest] Island grammar implementation

Wed Jan 12 02:33:47 PST 2011

On Wed, Jan 12, 2011 at 10:18 AM, Hiran Chaudhuri <Hiran.Chaudhuri at web.de>wrote:

> Hello, everybody.
> I've used ANTLR now for some simple cases, and it worked out quite well.
> But now I am facing files that can - between other items - contain regular
> expressions. These regular expressions are driving my lexer nuts as I cannot
> explain that the same characters mean different things within a regex as
> opposed to somewhere else in the file. As far as I understood such patterns
> can be addressed with (parser driven) Island Grammars, where the parser
> detects when a regex is expected and switches the lexer/parser combination
> until satisfied.
> However I am lacking a sample how to accomplish this. I looked at
> http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control,
> but it seems like this code does not compile directly (I'm working on ANTLR
> 3.2), and if I make the necessary amendments, it seems to always run out of
> heap space.
> Is there another example for Island Grammars controlled through the parser
> that I could use?
> Best regards Hiran
>

Here's a small demo that compiles using ANTLR v3.2.

Let's say your language consists of one ore more assignment statements. An
assignment is an identifiers followed by a number, another identifier or a
regex literal and ending with a semi-colon:

*grammar Language;*
*
*
*parse*
*  :  assignment+ EOF *
*  ;*
*
*
*assignment*
*  :  Identifier '=' atom ';' *
*  ;*
*
*
*atom*
*  :  Identifier*
*  |  Number*
*  |  Regex*
*  ;*
*
*
*Regex*
*  :  '/' ('\\' . | ~('/' | '\\'))* '/'*
*  ;*
*    *
*Identifier*
*  :  'a'..'z'*
*  |  'A'..'Z'*
*  ;*
*
*
*Number*
*  :  '0'..'9'+*
*  ;*
*
*
*Space*
*  :  (' ' | '\t' | '\r' | '\n') {skip();}*
*  ;*

and your (simplified) regex grammar looks like:

*grammar Regex;*
*
*
*parse*
*  :  Delim atom* Delim EOF*
*  ;*
*
*
*atom*
*  :  EscapeSequence *
*  |  CharClass  *
*  |  Other*
*  ;*
*  *
*Delim*
*  :  '/'*
*  ;*
*
*
*EscapeSequence*
*  :  '\\' ('\\' | '/')*
*  ;*
*  *
*CharClass*
*  :  '\\' ('d' | 'w' | 's')*
*  ;*
*
*
*Other*
*  :  ~Delim*
*  ;*

Now, to separately parse the regex literal inside your Language-grammar and
create an AST from it all, do something like this:

*grammar Language;*
*
*
*options {*
*  output=AST;*
*}*
*
*
*tokens {*
*  ROOT;*
*  REGEX;*
*  ASSIGNMENT;  *
*}*
*
*
*@parser::members {*
*  private CommonTree regexAST(String source) {*
*    try {*
*      ANTLRStringStream in = new ANTLRStringStream(source);*
*      RegexLexer lexer = new RegexLexer(in);*
*      CommonTokenStream tokens = new CommonTokenStream(lexer);*
*      RegexParser parser = new RegexParser(tokens);*
*      return (CommonTree)parser.parse().getTree();*
*    } catch(Exception e) {*
*      e.printStackTrace();*
*    }*
*    return null;*
*  }*
*}*
*
*
*parse*
*  :  assignment+ EOF -> ^(ROOT assignment+)*
*  ;*
*
*
*assignment*
*  :  Identifier '=' atom ';' -> ^(ASSIGNMENT Identifier atom)*
*  ;*
*
*
*atom*
*  :  Identifier*
*  |  Number*
*  |  r=Regex {CommonTree ast = regexAST($r.text);} -> ^( {ast} )*
*  ;*
*
*
*Regex*
*  :  '/' ('\\' . | ~('/' | '\\'))* '/'*
*  ;*
*    *
*Identifier*
*  :  'a'..'z'*
*  |  'A'..'Z'*
*  ;*
*
*
*Number*
*  :  '0'..'9'+*
*  ;*
*
*
*Space*
*  :  (' ' | '\t' | '\r' | '\n') {skip();}*
*  ;*
*
*
*grammar Regex;*
*
*
*options {*
*  output=AST;*
*}*
*
*
*tokens {*
*  REGEX;*
*  ESC_SEQ;*
*  CHAR_CLASS;*
*  OTHER;*
*}*
*
*
*parse*
*  :  Delim atom* Delim EOF -> ^(REGEX atom*)*
*  ;*
*
*
*atom*
*  :  EscapeSequence -> ^(ESC_SEQ EscapeSequence)*
*  |  CharClass      -> ^(CHAR_CLASS CharClass)*
*  |  Other          -> ^(OTHER Other)*
*  ;*
*  *
*Delim*
*  :  '/'*
*  ;*
*
*
*EscapeSequence*
*  :  '\\' ('\\' | '/')*
*  ;*
*  *
*CharClass*
*  :  '\\' ('d' | 'w' | 's')*
*  ;*
*
*
*Other*
*  :  ~Delim*
*  ;*

Now when parsing the source:

*a = 1;*
*b = a;*
*c = /\\\/\wab/;*

with your LanguageParser, you'll get the AST attached to this e-mail
message.

Below is the test class I used:

*import org.antlr.runtime.*;*
*import org.antlr.runtime.tree.*;*
*import org.antlr.stringtemplate.*;*
*
*
*public class Main {*
*    public static void main(String[] args) throws Exception {*
*        ANTLRStringStream in = new ANTLRStringStream(*
*                "a = 1;             \n" +*
*                "b = a;             \n" +*
*                "c = /\\\\\\/\\wab/;  "*
*        );*
*        LanguageLexer lexer = new LanguageLexer(in);*
*        CommonTokenStream tokens = new CommonTokenStream(lexer);*
*        LanguageParser parser = new LanguageParser(tokens);*
*        CommonTree tree = (CommonTree)parser.parse().getTree();*
*        DOTTreeGenerator gen = new DOTTreeGenerator();*
*        StringTemplate st = gen.toDOT(tree);*
*        System.out.println(st);*
*    }*
*}*

Regards,

Bart.

PS. in case any text gets mangled in my message, I attached a zip file
containing the grammar- and text class source files.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ast.gif
Type: image/gif
Size: 11908 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110112/184d06dd/attachment.gif 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: source.zip
Type: application/zip
Size: 1321 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110112/184d06dd/attachment.zip