[antlr-interest] Island grammar implementation
Bart Kiers
bkiers at gmail.com
Wed Jan 12 02:33:47 PST 2011
On Wed, Jan 12, 2011 at 10:18 AM, Hiran Chaudhuri <Hiran.Chaudhuri at web.de>wrote:
> Hello, everybody.
> I've used ANTLR now for some simple cases, and it worked out quite well.
> But now I am facing files that can - between other items - contain regular
> expressions. These regular expressions are driving my lexer nuts as I cannot
> explain that the same characters mean different things within a regex as
> opposed to somewhere else in the file. As far as I understood such patterns
> can be addressed with (parser driven) Island Grammars, where the parser
> detects when a regex is expected and switches the lexer/parser combination
> until satisfied.
> However I am lacking a sample how to accomplish this. I looked at
> http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control,
> but it seems like this code does not compile directly (I'm working on ANTLR
> 3.2), and if I make the necessary amendments, it seems to always run out of
> heap space.
> Is there another example for Island Grammars controlled through the parser
> that I could use?
> Best regards Hiran
>
Here's a small demo that compiles using ANTLR v3.2.
Let's say your language consists of one ore more assignment statements. An
assignment is an identifiers followed by a number, another identifier or a
regex literal and ending with a semi-colon:
*grammar Language;*
*
*
*parse*
* : assignment+ EOF *
* ;*
*
*
*assignment*
* : Identifier '=' atom ';' *
* ;*
*
*
*atom*
* : Identifier*
* | Number*
* | Regex*
* ;*
*
*
*Regex*
* : '/' ('\\' . | ~('/' | '\\'))* '/'*
* ;*
* *
*Identifier*
* : 'a'..'z'*
* | 'A'..'Z'*
* ;*
*
*
*Number*
* : '0'..'9'+*
* ;*
*
*
*Space*
* : (' ' | '\t' | '\r' | '\n') {skip();}*
* ;*
and your (simplified) regex grammar looks like:
*grammar Regex;*
*
*
*parse*
* : Delim atom* Delim EOF*
* ;*
*
*
*atom*
* : EscapeSequence *
* | CharClass *
* | Other*
* ;*
* *
*Delim*
* : '/'*
* ;*
*
*
*EscapeSequence*
* : '\\' ('\\' | '/')*
* ;*
* *
*CharClass*
* : '\\' ('d' | 'w' | 's')*
* ;*
*
*
*Other*
* : ~Delim*
* ;*
Now, to separately parse the regex literal inside your Language-grammar and
create an AST from it all, do something like this:
*grammar Language;*
*
*
*options {*
* output=AST;*
*}*
*
*
*tokens {*
* ROOT;*
* REGEX;*
* ASSIGNMENT; *
*}*
*
*
*@parser::members {*
* private CommonTree regexAST(String source) {*
* try {*
* ANTLRStringStream in = new ANTLRStringStream(source);*
* RegexLexer lexer = new RegexLexer(in);*
* CommonTokenStream tokens = new CommonTokenStream(lexer);*
* RegexParser parser = new RegexParser(tokens);*
* return (CommonTree)parser.parse().getTree();*
* } catch(Exception e) {*
* e.printStackTrace();*
* }*
* return null;*
* }*
*}*
*
*
*parse*
* : assignment+ EOF -> ^(ROOT assignment+)*
* ;*
*
*
*assignment*
* : Identifier '=' atom ';' -> ^(ASSIGNMENT Identifier atom)*
* ;*
*
*
*atom*
* : Identifier*
* | Number*
* | r=Regex {CommonTree ast = regexAST($r.text);} -> ^( {ast} )*
* ;*
*
*
*Regex*
* : '/' ('\\' . | ~('/' | '\\'))* '/'*
* ;*
* *
*Identifier*
* : 'a'..'z'*
* | 'A'..'Z'*
* ;*
*
*
*Number*
* : '0'..'9'+*
* ;*
*
*
*Space*
* : (' ' | '\t' | '\r' | '\n') {skip();}*
* ;*
*
*
*grammar Regex;*
*
*
*options {*
* output=AST;*
*}*
*
*
*tokens {*
* REGEX;*
* ESC_SEQ;*
* CHAR_CLASS;*
* OTHER;*
*}*
*
*
*parse*
* : Delim atom* Delim EOF -> ^(REGEX atom*)*
* ;*
*
*
*atom*
* : EscapeSequence -> ^(ESC_SEQ EscapeSequence)*
* | CharClass -> ^(CHAR_CLASS CharClass)*
* | Other -> ^(OTHER Other)*
* ;*
* *
*Delim*
* : '/'*
* ;*
*
*
*EscapeSequence*
* : '\\' ('\\' | '/')*
* ;*
* *
*CharClass*
* : '\\' ('d' | 'w' | 's')*
* ;*
*
*
*Other*
* : ~Delim*
* ;*
Now when parsing the source:
*a = 1;*
*b = a;*
*c = /\\\/\wab/;*
with your LanguageParser, you'll get the AST attached to this e-mail
message.
Below is the test class I used:
*import org.antlr.runtime.*;*
*import org.antlr.runtime.tree.*;*
*import org.antlr.stringtemplate.*;*
*
*
*public class Main {*
* public static void main(String[] args) throws Exception {*
* ANTLRStringStream in = new ANTLRStringStream(*
* "a = 1; \n" +*
* "b = a; \n" +*
* "c = /\\\\\\/\\wab/; "*
* );*
* LanguageLexer lexer = new LanguageLexer(in);*
* CommonTokenStream tokens = new CommonTokenStream(lexer);*
* LanguageParser parser = new LanguageParser(tokens);*
* CommonTree tree = (CommonTree)parser.parse().getTree();*
* DOTTreeGenerator gen = new DOTTreeGenerator();*
* StringTemplate st = gen.toDOT(tree);*
* System.out.println(st);*
* }*
*}*
Regards,
Bart.
PS. in case any text gets mangled in my message, I attached a zip file
containing the grammar- and text class source files.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ast.gif
Type: image/gif
Size: 11908 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110112/184d06dd/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: source.zip
Type: application/zip
Size: 1321 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20110112/184d06dd/attachment.zip
More information about the antlr-interest
mailing list