[antlr-interest] ANTLR3 Nested parser
Thomas Brandon
tbrandonau at gmail.com
Wed Jan 23 02:02:59 PST 2008
Not familiar with Scheme but assuming that all parentheses are nested apart
from known literal contexts it should be OK with something like:
SCHEME_BLOCK
: '#' NESTED_SCHEME_BLOCK
;
fragment
NESTED_SCHEME_BLOCK
'('
( options {greedy=false; k=1;}
: NESTED_SCHEME_BLOCK
| STRING_LITERAL
| CHAR_LITERAL
| .
)*
')'
;
Assuming string and char literals are the only literal contexts.
You may not need the k=1 option but a similar thing in the ANTLR grammar
causes the LL(*) analysis to explode, fixed by this.
Tom.
On Jan 23, 2008 7:20 PM, Bertalan Fodor (LilyPondTool) <
lilypondtool at organum.hu> wrote:
>
> But, that seems like you will end up lexing a lot of things you should
> not need to this way, especially if you have a lot of embedded elements.
> Perhaps if you mention what you are trying to parse, then better solutions
> can be thought of.
>
>
>
> One other thought is whether the first lexer can determine where the
> embedded language starts and stops in which case you can tokenize the whole
> text into one token and invoke the embedded language parser from your
> parser.
>
> Thanks for your ideas. I attach the nested language grammar to let you see
> for what I would need to find the end of input.
> The outer grammar looks for the '#' symbol, upon seeing that it will
> switch to the nested language like this.
>
> assignment_in_outer = #(inner-language expression (which-is scheme))
>
> The real fun is that the inner-language expression can contain even
> fragments written in the outer language. (If you know the Scheme language
> you can know that it allows redefining the language itself.) Now I consider
> them as multiline comments for simplicity.
>
> Bert
>
>
>
> grammar Scheme;
> options {
> output=AST;
> language=Java;
> memoize=true;
> }
>
> tokens {
> ARROW = '=>';
> ELSE = 'else';
>
> UNQUOTE = 'unquote';
> UNQUOTE_SPLICING = 'unquote-splicing';
> IF = 'if';
> SET = 'set!';
> COND = 'cond';
> AND = 'and';
> OR = 'or';
> CASE = 'case';
> LET = 'let';
> LETSTAR = 'let*';
> LETREC = 'letrec';
> DO = 'do';
> DELAY = 'delay';
> QUASIQUOTE = 'quasiquote';
>
> LET_SYNTAX = 'let-syntax';
> LETREC_SYNTAX = 'letrec-syntax';
> SYNTAX_RULES = 'syntax-rules';
> QUOTE= 'quote';
> LAMBDA= 'lambda';
> BEGIN= 'begin';
> DEFINE= 'define';
> DEFINESYNTAX= 'define-syntax'
> ;
> }
>
>
> @members {
> private int qqtDepth;
> private static java.util.logging.Logger log =
> java.util.logging.Logger.getLogger("LilyParser");
> private Set macroNames = new HashSet();
> }
> @header {
> package lilytool.parser.antlr;
>
> import java.util.HashSet;
> import java.util.Set;
> }
>
> @lexer::header {
> package lilytool.parser.antlr;
> }
>
>
> identifier returns [String value]:
> syntacticKeyword { $value = $syntacticKeyword.value; }
> | variable { $value = $variable.value; };
>
>
> syntacticKeyword returns [ String value ]:
> expressionKeyword { $value = $expressionKeyword.value; }
> | ELSE^ { $value = $ELSE.text; }
> | ARROW^ { $value = $ARROW.text; }
> | DEFINE^ { $value = $DEFINE.text; }
> | UNQUOTE^ { $value = $UNQUOTE.text; }
> | UNQUOTE_SPLICING^ { $value = $UNQUOTE_SPLICING.text; }
> ;
>
> expressionKeyword returns [ String value ]:
> QUOTE^ { $value = $QUOTE.text; }
> | LAMBDA^ { $value = $LAMBDA.text; }
> | IF^ { $value = $IF.text; }
> | SET^ { $value = $SET.text; }
> | BEGIN^ { $value = $BEGIN.text; }
> | COND^ { $value = $COND.text; }
> | AND^ { $value = $AND.text; }
> | OR^ { $value = $OR.text; }
> | CASE^ { $value = $CASE.text; }
> | LET^ { $value = $LET.text; }
> | LETSTAR^ { $value = $LETSTAR.text; }
> | LETREC^ { $value = $LETREC.text; }
> | DO^ { $value = $DO.text; }
> | DELAY^ { $value = $DELAY.text; }
> | QUASIQUOTE^ { $value = $QUASIQUOTE.text; }
> ;
>
> variable returns [ String value ] :
> VARIABLE^ { $value = $VARIABLE.text; }
> | '...' { $value = "..."; }
> ;
>
>
> /*
> * External representations
> */
>
> datum :
> simpleDatum
> | compoundDatum
> ;
>
> simpleDatum :
> BOOLEAN
> | NUMBER
> | CHARACTER
> | STRING
> | symbol
> ;
>
> symbol :
> identifier
> ;
>
> compoundDatum :
> list
> | vector
> ;
>
> list :
> E_OPEN ( ( datum )+ ( DOT^ datum )? )? E_CLOSE
> | abbreviation
> ;
>
> abbreviation :
> abbrevPrefix datum
> ;
>
> abbrevPrefix :
> APOS^
> | BAPOS^ // "`"
> | COMMA^ // ","
> | COMMAAT^ //",@"
> ;
>
> vector :
> HASHOPEN^ ( datum )* E_CLOSE
> ;
>
> /*
> * expressions
> */
>
> expression :
> variable
> | literal
> | lambdaExpression
> | conditional
> | assignment
> | derivedExpression
> | procedureCall
> | macroUse
> | macroBlock
> ;
>
> literal :
> quotation
> | selfEvaluating
> ;
>
> selfEvaluating :
> BOOLEAN
> | NUMBER
> | CHARACTER
> | s=STRING
> ;
>
> quotation :
> APOS^ datum
> | E_OPEN QUOTE datum E_CLOSE
> ;
>
> procedureCall :
> E_OPEN operator ( operand )* E_CLOSE
> ;
>
> operator :
> expression
> ;
>
> operand :
> expression
> ;
>
> lambdaExpression :
> E_OPEN LAMBDA formals body E_CLOSE
> ;
>
> formals :
> E_OPEN ( ( variable )+ ( DOT variable )? )? E_CLOSE
> | variable
> ;
>
> body :
> ((definition) => definition)* sequence
> ;
>
> sequence : command* expression;
>
> conditional :
> E_OPEN IF test consequent alternate E_CLOSE
> ;
>
> test :
> expression
> ;
>
> consequent :
> expression
> ;
>
> alternate :
> ( expression )?
> ;
>
> assignment :
> E_OPEN SET variable expression E_CLOSE
> ;
>
> derivedExpression :
> (quasiquotation)=>quasiquotation |
> E_OPEN ( COND ( condClause+ (E_OPEN ELSE sequence E_CLOSE)? | E_OPEN ELSE
> sequence E_CLOSE)
> | CASE expression ( caseClause+ (E_OPEN ELSE sequence E_CLOSE)? |
> E_OPEN ELSE sequence E_CLOSE)
> | AND ( test )*
> | OR ( test )*
> | LET ( variable )? E_OPEN ( bindingSpec )* E_CLOSE body
> | LETSTAR E_OPEN ( bindingSpec )* E_CLOSE body
> | LETREC E_OPEN ( bindingSpec )* E_CLOSE body
> | BEGIN sequence
> | DO E_OPEN ( iterationSpec )* E_CLOSE E_OPEN test doResult E_CLOSE
> ( command )*
> | DELAY expression ) E_CLOSE
> ;
>
> command : expression ;
>
> condClause :
> E_OPEN test ( sequence | ARROW recipient )? E_CLOSE;
>
> recipient :
> expression;
>
> caseClause :
> E_OPEN E_OPEN ( datum )* E_CLOSE sequence E_CLOSE;
>
> bindingSpec :
> E_OPEN variable expression E_CLOSE;
>
> iterationSpec :
> E_OPEN variable init ( step )? E_CLOSE
> ;
>
> init :
> expression
> ;
>
> step :
> expression
> ;
>
> doResult :
> ( sequence )?
> ;
>
> macroUse :
> { macroNames.contains(((TokenStream)input).LT(2).getText())}? E_OPEN
> keyword ( datum )* E_CLOSE
> ;
>
> keyword returns [ String value ]: identifier { $value = $identifier.value;
> };
>
> macroBlock :
> E_OPEN ( LET_SYNTAX | LETREC_SYNTAX ) E_OPEN ( syntaxSpec )* E_CLOSE body
> E_CLOSE { macroNames.add($syntaxSpec.name); }
> ;
>
> syntaxSpec returns [String name] :
> E_OPEN keyword { $name = $keyword.value; } transformerSpec E_CLOSE
> ;
>
> quasiquotation
> options { backtrack = true;}
> scope {
> int d;
> }
> @init {
> $quasiquotation::d=1;
> }
> :
> quasiquotationD;
>
> qQTemplate
> options { backtrack = true;}
> @init { qqtDepth = $quasiquotation::d; } :
> { qqtDepth == 0 }?=>expression
> | simpleDatum
> | vectorQQTemplate
> | listQQTemplate
> | unquotation
> ;
>
> quasiquotationD
> :
> BAPOS^ qQTemplate
> | E_OPEN QUASIQUOTE qQTemplate E_CLOSE
> ;
>
> listQQTemplate
> options { backtrack = true;}
> :
> APOS^ qQTemplate
> | { $quasiquotation::d+=1; } quasiquotationD
> | E_OPEN ( ( qQTemplateOrSplice )+ ( DOT^ qQTemplate )? )? E_CLOSE
> ;
>
> vectorQQTemplate:
> HASHOPEN ( qQTemplateOrSplice )* E_CLOSE
> ;
>
> unquotation:
> COMMA^ { $quasiquotation::d-=1; } qQTemplate
> | E_OPEN UNQUOTE { $quasiquotation::d-=1; } qQTemplate E_CLOSE
> ;
>
> qQTemplateOrSplice
> options { backtrack = true;}
> :
> qQTemplate
> | splicingUnquotation
> ;
>
> splicingUnquotation :
> COMMAAT^ { $quasiquotation::d-=1; } qQTemplate
> | E_OPEN UNQUOTE_SPLICING { $quasiquotation::d-=1; } qQTemplate E_CLOSE
> ;
>
> /*
> * Transformers
> */
>
> transformerSpec :
> E_OPEN SYNTAX_RULES E_OPEN ( identifier )* E_CLOSE ( syntaxRule )*
> E_CLOSE
> ;
>
> syntaxRule :
> E_OPEN pattern template E_CLOSE
> ;
>
> pattern :
> patternIdentifier
> | E_OPEN ( ( pattern )+ ( DOT pattern | ellipsis )? )? E_CLOSE
> | HASHOPEN ( ( pattern )+ ( ellipsis )? )? E_CLOSE
> | patternDatum
> ;
>
> patternDatum :
> STRING
> | CHARACTER
> | BOOLEAN
> | NUMBER
> ;
>
> template :
> patternIdentifier
> | E_OPEN ( ( templateElement )+ ( DOT templateElement )? )? E_CLOSE
> | HASHOPEN ( templateElement )* E_CLOSE
> | templateDatum
> ;
>
> templateElement :
> template ( ellipsis )?
> ;
>
> templateDatum :
> patternDatum
> ;
>
> patternIdentifier : /* any identifier except "..." */
> syntacticKeyword
> | VARIABLE
> ;
>
> ellipsis :
> '...'
> ;
>
>
> scm returns [ String text] @init { $text = ""; } : commandOrDefinition {
> // for our purpose we allow only one and only command in an
> SCM_TOKEN block
> // TokenSelector.selector.pop();
>
> };
>
> commandOrDefinition :
> syntaxDefinition
> | (definition)=>definition
> | command
> ;
>
> definition :
> E_OPEN (
> DEFINE ( variable expression | E_OPEN variable defFormals E_CLOSE
> body )
> | BEGIN definition*) E_CLOSE
> ;
>
> defFormals :
> (variable)* ( DOT variable)?;
>
> syntaxDefinition :
> E_OPEN DEFINESYNTAX keyword transformerSpec E_CLOSE;
>
> /* LEXER
> */
>
> VARIABLE: INITIAL (SUBSEQUENT)* | PECULIAR_IDENTIFIER;
>
> fragment INITIAL : LETTER | SPECIAL_INITIAL;
> fragment LETTER : ('a'..'z'| 'A'..'Z' | '\u0080'..'\ufffe') ;
> fragment SPECIAL_INITIAL
> :('!'|'$'|'%'|'&'|'*'|'/'|':'|'<'|'='|'>'|'?'|'^'|'_'|'~'|'{'|'}'|'#:');
> fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
> fragment DIGIT : ('0'..'9');
> fragment SPECIAL_SUBSEQUENT : '.' | '+'|'-' | '@' ;
> fragment PECULIAR_IDENTIFIER : '+' | '-';
>
> APOS: '\'';
> BAPOS: '`';
> COMMA: ',';
> COMMAAT: ',@';
> DOT : '.';
>
> BOOLEAN : '#t' | '#f' ;
>
> HASHOPEN: '#(';
>
> CHARACTER : '#\\' (CHARACTER_NAME | ~(' '|'\n')) ;
> fragment CHARACTER_NAME : 'space' | 'newline';
>
> STRING : '"' STRING_ELEMENT* '"';
> fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\' );
>
> NUMBER :
> NUM_2 |
> NUM_8 |
> NUM_10 |
> NUM_16
> ;
> fragment NUM_2 : PREFIX_2 COMPLEX_2;
> fragment NUM_8 : PREFIX_8 COMPLEX_8;
> fragment NUM_10 : PREFIX_10? COMPLEX_10;
> fragment NUM_16 : PREFIX_16 COMPLEX_16;
> fragment COMPLEX_2 :
> REAL_2 ('@' REAL_2)? |
> REAL_2? SIGN ( UREAL_2 )? 'i'
> ;
> fragment COMPLEX_8 :
> REAL_8 ( '@' REAL_8)? |
> REAL_8? SIGN UREAL_8? 'i'
> ;
> fragment COMPLEX_10 :
> REAL_10 ('@' REAL_10)? |
> REAL_10? SIGN UREAL_10? 'i'
> ;
> fragment COMPLEX_16 :
> REAL_16 ('@' REAL_16)? |
> REAL_16? SIGN UREAL_16? 'i'
> ;
> fragment REAL_2 : (SIGN)? UREAL_2;
> fragment REAL_8 : (SIGN)? UREAL_8;
> fragment REAL_10 : (SIGN)? UREAL_10;
> fragment REAL_16 : (SIGN)? UREAL_16;
> fragment UREAL_2 : UINTEGER_2 ( '/' UINTEGER_2 )?;
> fragment UREAL_8 : UINTEGER_8 ( '/' UINTEGER_8 )?;
> fragment UREAL_10 : (UINTEGER_10 '/')=> UINTEGER_10 '/' UINTEGER_10 |
> DECIMAL_10;
> fragment UREAL_16 : UINTEGER_16 ( '/' UINTEGER_16 )?;
> fragment DECIMAL_10 :
> ( UINTEGER_10
> | '.' DIGIT_10+ '#'*
> | DIGIT_10+ '.' DIGIT_10* '#'*
> | DIGIT_10+ '#'+ '.' '#'*
> ) SUFFIX
> ;
> fragment UINTEGER_2 : ( DIGIT_2 )+ ( '#' )*;
> fragment UINTEGER_8 : ( DIGIT_8 )+ ( '#' )*;
> fragment UINTEGER_10 : ( DIGIT_10 )+ ( '#' )*;
> fragment UINTEGER_16 : ( DIGIT_16 )+ ( '#' )*;
> fragment PREFIX_2 : EXACTNESS RADIX_2 | RADIX_2 EXACTNESS | RADIX_2;
> fragment PREFIX_8 : RADIX_8 EXACTNESS | EXACTNESS RADIX_8 | RADIX_8;
> fragment PREFIX_10 : RADIX_10 EXACTNESS | EXACTNESS RADIX_10 | EXACTNESS |
> RADIX_10;
> fragment PREFIX_16 : RADIX_16 EXACTNESS | EXACTNESS RADIX_16 | RADIX_16;
> fragment SUFFIX : (EXPONENT_MARKER SIGN? DIGIT+)?;
> fragment EXPONENT_MARKER : ('e'|'s'|'f'|'d'|'l');
> fragment SIGN : ('+'|'-');
> fragment EXACTNESS : ( '#i' | '#e');
> fragment RADIX_2 : '#b';
> fragment RADIX_8 : '#o';
> fragment RADIX_10 : '#d';
> fragment RADIX_16 : '#x';
> fragment DIGIT_2 : '0'|'1';
> fragment DIGIT_8 : '0'..'7';
> fragment DIGIT_10 : '0'..'9';
> fragment DIGIT_16 : DIGIT_10 | 'a'..'f';
>
> COMMENT :
> ';' .* '\n' {$channel=HIDDEN;}
> ;
>
> WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
> ;
>
>
> E_OPEN : '(';
> E_CLOSE: ')';
>
> /* Lilypond expressions are regarded as ML comments now.
> However, we could switch to another lexer analogueous to SCM_T in
> lily-antlr.g
> */
> LILYEXP
> : '#{' (options {greedy=false;} : .)* '#}' {channel=HIDDEN;}
> ;
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080123/478180e0/attachment-0001.html
More information about the antlr-interest
mailing list