[antlr-interest] ANTLR3 Nested parser

Thomas Brandon tbrandonau at gmail.com
Wed Jan 23 02:02:59 PST 2008


Not familiar with Scheme but assuming that all parentheses are nested apart
from known literal contexts it should be OK with something like:
SCHEME_BLOCK
    :    '#' NESTED_SCHEME_BLOCK
    ;

fragment
NESTED_SCHEME_BLOCK
    '('
    (    options {greedy=false; k=1;}
    :    NESTED_SCHEME_BLOCK
    |    STRING_LITERAL
    |    CHAR_LITERAL
    |    .
    )*
    ')'
    ;
Assuming string and char literals are the only literal contexts.
You may not need the k=1 option but a similar thing in the ANTLR grammar
causes the LL(*) analysis to explode, fixed by this.

Tom.
On Jan 23, 2008 7:20 PM, Bertalan Fodor (LilyPondTool) <
lilypondtool at organum.hu> wrote:

>
>  But, that seems like you will end up lexing a lot of things you should
> not need to this way, especially if you have a lot of embedded elements.
> Perhaps if you mention what you are trying to parse, then better solutions
> can be thought of.
>
>
>
> One other thought is whether the first lexer can determine where the
> embedded language starts and stops in which case you can tokenize the whole
> text into one token and invoke the embedded language parser from your
> parser.
>
> Thanks for your ideas. I attach the nested language grammar to let you see
> for what I would need to find the end of input.
> The outer grammar looks for the '#' symbol, upon seeing that it will
> switch to the nested language like this.
>
> assignment_in_outer = #(inner-language expression (which-is scheme))
>
> The real fun is that the inner-language expression can contain even
> fragments written in the outer language. (If you know the Scheme language
> you can know that it allows redefining the language itself.) Now I consider
> them as multiline comments for simplicity.
>
> Bert
>
>
>
> grammar Scheme;
> options {
>        output=AST;
>        language=Java;
>        memoize=true;
> }
>
> tokens {
>        ARROW = '=>';
>        ELSE = 'else';
>
>        UNQUOTE = 'unquote';
>        UNQUOTE_SPLICING = 'unquote-splicing';
>        IF = 'if';
>        SET = 'set!';
>        COND = 'cond';
>        AND = 'and';
>        OR = 'or';
>        CASE = 'case';
>        LET = 'let';
>        LETSTAR = 'let*';
>        LETREC = 'letrec';
>        DO = 'do';
>        DELAY = 'delay';
>        QUASIQUOTE = 'quasiquote';
>
>        LET_SYNTAX = 'let-syntax';
>        LETREC_SYNTAX = 'letrec-syntax';
>        SYNTAX_RULES = 'syntax-rules';
>        QUOTE=  'quote';
>        LAMBDA= 'lambda';
>        BEGIN=  'begin';
>        DEFINE= 'define';
>        DEFINESYNTAX=   'define-syntax'
>        ;
> }
>
>
> @members {
>        private int qqtDepth;
>        private static java.util.logging.Logger log =
> java.util.logging.Logger.getLogger("LilyParser");
>        private Set macroNames = new HashSet();
>        }
> @header {
>        package lilytool.parser.antlr;
>
>        import java.util.HashSet;
>        import java.util.Set;
> }
>
> @lexer::header {
>        package lilytool.parser.antlr;
> }
>
>
> identifier returns [String value]:
>  syntacticKeyword { $value = $syntacticKeyword.value; }
> | variable { $value = $variable.value; };
>
>
> syntacticKeyword returns [ String value ]:
>  expressionKeyword { $value = $expressionKeyword.value; }
> | ELSE^ { $value = $ELSE.text; }
> | ARROW^ { $value = $ARROW.text; }
> | DEFINE^ { $value = $DEFINE.text; }
> | UNQUOTE^ { $value = $UNQUOTE.text; }
> | UNQUOTE_SPLICING^ { $value = $UNQUOTE_SPLICING.text; }
> ;
>
> expressionKeyword returns [ String value ]:
>  QUOTE^ { $value = $QUOTE.text; }
> | LAMBDA^ { $value = $LAMBDA.text; }
> | IF^ { $value = $IF.text; }
> | SET^ { $value = $SET.text; }
> | BEGIN^ { $value = $BEGIN.text; }
> | COND^ { $value = $COND.text; }
> | AND^ { $value = $AND.text; }
> | OR^ { $value = $OR.text; }
> | CASE^ { $value = $CASE.text; }
> | LET^ { $value = $LET.text; }
> | LETSTAR^ { $value = $LETSTAR.text; }
> | LETREC^ { $value = $LETREC.text; }
> | DO^ { $value = $DO.text; }
> | DELAY^ { $value = $DELAY.text; }
> | QUASIQUOTE^ { $value = $QUASIQUOTE.text; }
> ;
>
> variable returns [ String value ] :
>  VARIABLE^ { $value = $VARIABLE.text; }
> | '...' { $value = "..."; }
> ;
>
>
> /*
>  * External representations
>  */
>
> datum :
>  simpleDatum
> | compoundDatum
> ;
>
> simpleDatum :
>  BOOLEAN
> | NUMBER
> | CHARACTER
> | STRING
> | symbol
> ;
>
> symbol :
>  identifier
> ;
>
> compoundDatum :
>  list
> | vector
> ;
>
> list :
>  E_OPEN ( ( datum )+ ( DOT^ datum )? )? E_CLOSE
> | abbreviation
> ;
>
> abbreviation :
>  abbrevPrefix datum
> ;
>
> abbrevPrefix :
>  APOS^
> | BAPOS^ // "`"
> | COMMA^ // ","
> | COMMAAT^ //",@"
> ;
>
> vector :
>  HASHOPEN^ ( datum )* E_CLOSE
> ;
>
> /*
>  * expressions
>  */
>
> expression :
>  variable
> | literal
> |  lambdaExpression
> |  conditional
> |  assignment
> |  derivedExpression
> | procedureCall
> |  macroUse
> | macroBlock
> ;
>
> literal :
>  quotation
> | selfEvaluating
> ;
>
> selfEvaluating :
>  BOOLEAN
> | NUMBER
> | CHARACTER
> | s=STRING
> ;
>
> quotation :
>  APOS^ datum
> | E_OPEN QUOTE datum E_CLOSE
> ;
>
> procedureCall :
>  E_OPEN operator ( operand )* E_CLOSE
> ;
>
> operator :
>  expression
> ;
>
> operand :
>  expression
> ;
>
> lambdaExpression :
>  E_OPEN LAMBDA formals body E_CLOSE
> ;
>
> formals :
>  E_OPEN ( ( variable )+ ( DOT variable )? )? E_CLOSE
> | variable
> ;
>
> body :
> ((definition) => definition)* sequence
> ;
>
> sequence :  command* expression;
>
> conditional :
>  E_OPEN IF test consequent alternate E_CLOSE
> ;
>
> test :
>  expression
> ;
>
> consequent :
>  expression
> ;
>
> alternate :
> ( expression )?
> ;
>
> assignment :
>  E_OPEN SET variable expression E_CLOSE
> ;
>
> derivedExpression :
> (quasiquotation)=>quasiquotation |
> E_OPEN ( COND ( condClause+ (E_OPEN ELSE sequence E_CLOSE)? | E_OPEN ELSE
> sequence E_CLOSE)
>      | CASE expression ( caseClause+ (E_OPEN ELSE sequence E_CLOSE)? |
> E_OPEN ELSE sequence E_CLOSE)
>      | AND ( test )*
>      | OR ( test )*
>      | LET ( variable )? E_OPEN ( bindingSpec )* E_CLOSE body
>      | LETSTAR E_OPEN ( bindingSpec )* E_CLOSE body
>      | LETREC E_OPEN ( bindingSpec )* E_CLOSE body
>      | BEGIN sequence
>      | DO E_OPEN ( iterationSpec )* E_CLOSE E_OPEN test doResult E_CLOSE
>        ( command )*
>      | DELAY expression ) E_CLOSE
> ;
>
> command : expression    ;
>
> condClause :
>  E_OPEN test ( sequence | ARROW recipient )? E_CLOSE;
>
> recipient :
>  expression;
>
> caseClause :
>  E_OPEN E_OPEN ( datum )* E_CLOSE sequence E_CLOSE;
>
> bindingSpec :
> E_OPEN variable expression E_CLOSE;
>
> iterationSpec :
>  E_OPEN variable init ( step )? E_CLOSE
> ;
>
> init :
>  expression
> ;
>
> step :
>  expression
> ;
>
> doResult :
>  ( sequence )?
> ;
>
> macroUse :
>  { macroNames.contains(((TokenStream)input).LT(2).getText())}? E_OPEN
> keyword ( datum )* E_CLOSE
> ;
>
> keyword returns [ String value ]: identifier { $value = $identifier.value;
> };
>
> macroBlock :
>  E_OPEN ( LET_SYNTAX | LETREC_SYNTAX ) E_OPEN ( syntaxSpec )* E_CLOSE body
> E_CLOSE { macroNames.add($syntaxSpec.name); }
> ;
>
> syntaxSpec returns [String name] :
>  E_OPEN keyword { $name = $keyword.value; } transformerSpec E_CLOSE
> ;
>
> quasiquotation
> options { backtrack = true;}
> scope {
>        int d;
>        }
> @init {
>        $quasiquotation::d=1;
> }
> :
>  quasiquotationD;
>
> qQTemplate
> options { backtrack = true;}
> @init { qqtDepth = $quasiquotation::d; } :
>  { qqtDepth == 0 }?=>expression
> | simpleDatum
> | vectorQQTemplate
> | listQQTemplate
> | unquotation
> ;
>
> quasiquotationD
> :
>  BAPOS^ qQTemplate
> | E_OPEN QUASIQUOTE qQTemplate E_CLOSE
> ;
>
> listQQTemplate
> options { backtrack = true;}
> :
>  APOS^ qQTemplate
> |  { $quasiquotation::d+=1; } quasiquotationD
> |        E_OPEN ( ( qQTemplateOrSplice )+ ( DOT^ qQTemplate )? )? E_CLOSE
> ;
>
> vectorQQTemplate:
>  HASHOPEN ( qQTemplateOrSplice )* E_CLOSE
> ;
>
> unquotation:
>  COMMA^ { $quasiquotation::d-=1; } qQTemplate
> | E_OPEN UNQUOTE { $quasiquotation::d-=1; } qQTemplate E_CLOSE
> ;
>
> qQTemplateOrSplice
> options { backtrack = true;}
> :
>   qQTemplate
> | splicingUnquotation
> ;
>
> splicingUnquotation :
>  COMMAAT^ { $quasiquotation::d-=1; } qQTemplate
> | E_OPEN UNQUOTE_SPLICING { $quasiquotation::d-=1; } qQTemplate E_CLOSE
> ;
>
> /*
>  * Transformers
>  */
>
> transformerSpec :
>  E_OPEN SYNTAX_RULES E_OPEN ( identifier )* E_CLOSE ( syntaxRule )*
> E_CLOSE
> ;
>
> syntaxRule :
>  E_OPEN pattern template E_CLOSE
> ;
>
> pattern :
>  patternIdentifier
> | E_OPEN ( ( pattern )+ ( DOT pattern | ellipsis )? )?  E_CLOSE
> | HASHOPEN ( ( pattern )+ ( ellipsis )? )? E_CLOSE
> | patternDatum
> ;
>
> patternDatum :
>  STRING
> | CHARACTER
> | BOOLEAN
> | NUMBER
> ;
>
> template :
>  patternIdentifier
> | E_OPEN ( ( templateElement )+ ( DOT templateElement )? )? E_CLOSE
> | HASHOPEN ( templateElement )* E_CLOSE
> | templateDatum
> ;
>
> templateElement :
>  template ( ellipsis )?
> ;
>
> templateDatum :
>  patternDatum
> ;
>
> patternIdentifier : /* any identifier except "..." */
>  syntacticKeyword
> | VARIABLE
> ;
>
> ellipsis :
>  '...'
> ;
>
>
> scm returns [ String text] @init { $text = ""; } :  commandOrDefinition {
>        // for our purpose we allow only one and only command in an
> SCM_TOKEN block
>        //   TokenSelector.selector.pop();
>
> };
>
> commandOrDefinition :
>        syntaxDefinition
>        | (definition)=>definition
>        | command
> ;
>
> definition :
> E_OPEN (
>        DEFINE ( variable expression | E_OPEN variable defFormals E_CLOSE
> body )
>        |       BEGIN definition*) E_CLOSE
> ;
>
> defFormals :
> (variable)* ( DOT variable)?;
>
> syntaxDefinition :
>  E_OPEN DEFINESYNTAX keyword transformerSpec E_CLOSE;
>
>  /* LEXER
>  */
>
> VARIABLE:       INITIAL (SUBSEQUENT)* | PECULIAR_IDENTIFIER;
>
> fragment INITIAL : LETTER | SPECIAL_INITIAL;
> fragment LETTER : ('a'..'z'| 'A'..'Z' | '\u0080'..'\ufffe') ;
> fragment SPECIAL_INITIAL
> :('!'|'$'|'%'|'&'|'*'|'/'|':'|'<'|'='|'>'|'?'|'^'|'_'|'~'|'{'|'}'|'#:');
> fragment SUBSEQUENT : INITIAL | DIGIT | SPECIAL_SUBSEQUENT;
> fragment DIGIT : ('0'..'9');
> fragment SPECIAL_SUBSEQUENT :  '.' | '+'|'-' | '@' ;
> fragment PECULIAR_IDENTIFIER : '+' | '-';
>
> APOS: '\'';
> BAPOS: '`';
> COMMA: ',';
> COMMAAT: ',@';
> DOT : '.';
>
> BOOLEAN : '#t' | '#f' ;
>
> HASHOPEN: '#(';
>
> CHARACTER : '#\\' (CHARACTER_NAME | ~(' '|'\n')) ;
> fragment CHARACTER_NAME : 'space' | 'newline';
>
> STRING : '"' STRING_ELEMENT* '"';
> fragment STRING_ELEMENT : ~('"' | '\\') | '\\' ('"' | '\\' );
>
> NUMBER :
>        NUM_2 |
>        NUM_8 |
>        NUM_10 |
>        NUM_16
> ;
> fragment NUM_2 : PREFIX_2 COMPLEX_2;
> fragment NUM_8 : PREFIX_8 COMPLEX_8;
> fragment NUM_10 : PREFIX_10? COMPLEX_10;
> fragment NUM_16 : PREFIX_16 COMPLEX_16;
> fragment COMPLEX_2 :
>        REAL_2 ('@' REAL_2)? |
>        REAL_2? SIGN ( UREAL_2 )? 'i'
>        ;
> fragment COMPLEX_8 :
>        REAL_8 ( '@' REAL_8)? |
>        REAL_8? SIGN UREAL_8? 'i'
>        ;
> fragment COMPLEX_10 :
>        REAL_10 ('@' REAL_10)? |
>        REAL_10? SIGN UREAL_10? 'i'
> ;
> fragment COMPLEX_16 :
>        REAL_16 ('@' REAL_16)? |
>        REAL_16? SIGN UREAL_16? 'i'
>        ;
> fragment REAL_2 : (SIGN)? UREAL_2;
> fragment REAL_8 : (SIGN)? UREAL_8;
> fragment REAL_10 : (SIGN)? UREAL_10;
> fragment REAL_16 : (SIGN)? UREAL_16;
> fragment UREAL_2 : UINTEGER_2 ( '/' UINTEGER_2 )?;
> fragment UREAL_8 : UINTEGER_8 ( '/' UINTEGER_8 )?;
> fragment UREAL_10 : (UINTEGER_10 '/')=> UINTEGER_10 '/' UINTEGER_10 |
> DECIMAL_10;
> fragment UREAL_16 : UINTEGER_16 ( '/' UINTEGER_16 )?;
> fragment DECIMAL_10 :
>        ( UINTEGER_10
>        | '.' DIGIT_10+ '#'*
>        | DIGIT_10+ '.' DIGIT_10* '#'*
>        | DIGIT_10+ '#'+ '.' '#'*
>        ) SUFFIX
>        ;
> fragment UINTEGER_2 : ( DIGIT_2 )+ ( '#' )*;
> fragment UINTEGER_8 : ( DIGIT_8 )+ ( '#' )*;
> fragment UINTEGER_10 : ( DIGIT_10 )+ ( '#' )*;
> fragment UINTEGER_16 : ( DIGIT_16 )+ ( '#' )*;
> fragment PREFIX_2 : EXACTNESS RADIX_2 | RADIX_2 EXACTNESS | RADIX_2;
> fragment PREFIX_8 : RADIX_8 EXACTNESS | EXACTNESS RADIX_8 | RADIX_8;
> fragment PREFIX_10 : RADIX_10 EXACTNESS | EXACTNESS RADIX_10 | EXACTNESS |
> RADIX_10;
> fragment PREFIX_16 : RADIX_16 EXACTNESS | EXACTNESS RADIX_16 | RADIX_16;
> fragment SUFFIX : (EXPONENT_MARKER SIGN? DIGIT+)?;
> fragment EXPONENT_MARKER : ('e'|'s'|'f'|'d'|'l');
> fragment SIGN : ('+'|'-');
> fragment EXACTNESS : ( '#i' | '#e');
> fragment RADIX_2 : '#b';
> fragment RADIX_8 : '#o';
> fragment RADIX_10 : '#d';
> fragment RADIX_16 : '#x';
> fragment DIGIT_2 : '0'|'1';
> fragment DIGIT_8 : '0'..'7';
> fragment DIGIT_10 : '0'..'9';
> fragment DIGIT_16 : DIGIT_10 | 'a'..'f';
>
> COMMENT :
>                ';' .* '\n' {$channel=HIDDEN;}
>                ;
>
> WS :    (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
>        ;
>
>
> E_OPEN : '(';
> E_CLOSE: ')';
>
> /* Lilypond expressions are regarded as ML comments now.
> However, we could switch to another lexer analogueous to SCM_T in
> lily-antlr.g
> */
> LILYEXP
>        : '#{' (options {greedy=false;} : .)*   '#}'  {channel=HIDDEN;}
>        ;
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080123/478180e0/attachment-0001.html 


More information about the antlr-interest mailing list