[antlr-interest] Can anyone help with a basic grammar problem in Antlr 3?
Michael Bedward
michael.bedward at gmail.com
Fri Oct 14 05:30:05 PDT 2011
Hi Ross,
Just a quick response right now (it's bedtime here)...
This is sounding like something best handled in the AST rather than
trying to bung it all in the initial parse. I would look at making the
grammar lenient enough to accept your naked method calls and then
disambiguate on a subsequent pass when you have a symbol table hooked
up.
I'll have a proper look tomorrow, and perhaps others here might
suggest a better approach in the mean time.
Michael
On 14 October 2011 23:11, Ross Bamford <roscoml at gmail.com> wrote:
> Hi Michael,
> I've tried adding IDENTIFIER into the atom rule, and that solved the
> assignment expression issues I was having, but unfortunately it broke the
> method call parsing completely - the issue seemed to stem from the parser
> not being able to differentiate between a function call, and a plain
> identifier - cue many and varied MismatchedSetExceptions :(. After much
> debugging with ANTLRWorks (what a great tool by the way!!) the only thing
> I've found that fixes this is to make parens mandatory on a method call.
> Having done that after making the other changes you suggested I've now got
> my tests passing (except obviously the ones that call methods without
> parens). I've also made some other changes based on another of your messages
> I found in the archives, to make input such as "foo(1 2)" throw an exception
> rather than just printing a warning and ignoring the "2" - so thanks again!
> :)
> I would still really like to have optional parens on method calls, which I
> know is difficult... I have a little experience with parsing Ruby, for
> example, and I know there's an ambiguous case such as:
> a = foo
> whereby do I treat foo as a var or a method? In other projects I've managed
> to handle that at runtime by treating foo as a var only if an assignment
> statement such as "foo = 1" has been seen before, and I seem to recall I
> simply parsed all such cases as a bare IDENTIFIER to be resolved later, but
> I have no idea how to make this work in my current Antlr parser.
> Here is my grammar as it stands now:
> /* ************* GRAMMAR **************** */
> grammar BasicLang;
> options {
> output=AST;
> ASTLabelType=CommonTree;
> backtrack=true;
> memoize=true;
> }
> tokens {
> ASSIGN;
> METHOD_CALL;
> SELF;
> }
> @parser::members {
> /* throw exceptions rather than silently failing... */
> protected void mismatch(IntStream input, int ttype, BitSet follow)
> throws RecognitionException
> {
> throw new MismatchedTokenException(ttype, input);
> }
> @Override
> public Object recoverFromMismatchedSet(IntStream input, RecognitionException
> e, BitSet follow)
> throws RecognitionException
> {
> throw e;
> }
> @Override
> protected Object recoverFromMismatchedToken(IntStream input, int
> ttype, BitSet follow) throws RecognitionException {
> if (ttype == RPAREN) {
> throw new UnwantedTokenException(); // ("Invalid input in argument
> list");
> }
> return super.recoverFromMismatchedToken(input, ttype, follow);
> }
> }
> @rulecatch {
> // throw exceptions rather than silently failing...
> catch (RecognitionException e) {
> throw e;
> }
> }
> start_rule
> : script
> ;
> script
> : statement+
> | EOF!
> ;
> statement
> : expr terminator!
> ;
>
> expr
> : assign_expr
> | math_expr
> | meth_call_expr
> ;
>
> meth_call_expr
> : (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
> func_call_expr)
> | (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL STRING_LITERAL?
> func_call_expr)
> ;
>
> fragment
> func_call_expr
> : IDENTIFIER^ argument_list
> ;
>
> fragment
> argument_list
> : LPAREN! (expr (COMMA! expr)*)? RPAREN!
> ;
>
> assign_expr
> : IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
> ;
> math_expr
> : mult_expr ((ADD^|SUB^) mult_expr)*
> ;
> mult_expr
> : pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
> ;
>
> pow_expr
> : unary_expr ((POW^) unary_expr)*
> ;
>
> unary_expr
> : NOT? atom
> ;
> atom
> : literal
> | IDENTIFIER
> | LPAREN! expr RPAREN!
> ;
>
> literal
> : HEX_LITERAL
> | DECIMAL_LITERAL
> | OCTAL_LITERAL
> | FLOATING_POINT_LITERAL
> // | REGEXP_LITERAL
> | STRING_LITERAL
> ;
>
> terminator
> : TERMINATOR
> | EOF
> ;
> POW : '^' ;
> MOD : '%' ;
> ADD : '+' ;
> SUB : '-' ;
> DIV : '/' ;
> MUL : '*' ;
> NOT : '!' ;
> ASSIGN
> : '='
> ;
>
> LPAREN
> : '('
> ;
>
> RPAREN
> : ')'
> ;
>
> COMMA
> : ','
> ;
>
> DOT : '.' ;
> CHARACTER_LITERAL
> : '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
> ;
> STRING_LITERAL
> : '"' ( EscapeSequence | ~('\\'|'"') )* '"'
> ;
> /*
> REGEXP_LITERAL
> : '/' ( EscapeSequence | ~('\\'|'"') )* '/'
> ;
> */
> HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;
> DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;
> OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;
> fragment
> HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
> fragment
> IntegerTypeSuffix
> : ('l'|'L')
> | ('u'|'U') ('l'|'L')?
> ;
> FLOATING_POINT_LITERAL
> : ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
> | '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
> | ('0'..'9')+ Exponent? FloatTypeSuffix?
> ;
> fragment
> Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
> fragment
> FloatTypeSuffix : ('f'|'F'|'d'|'D') ;
> fragment
> EscapeSequence
> : '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
> | OctalEscape
> ;
> fragment
> OctalEscape
> : '\\' ('0'..'3') ('0'..'7') ('0'..'7')
> | '\\' ('0'..'7') ('0'..'7')
> | '\\' ('0'..'7')
> ;
> fragment
> UnicodeEscape
> : '\\' 'u' HexDigit HexDigit HexDigit HexDigit
> ;
> COMMENT
> : '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
> ;
> LINE_COMMENT
> : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
> ;
>
> IDENTIFIER
> : ID_LETTER (ID_LETTER|'0'..'9')*
> ;
>
> fragment
> ID_LETTER
> : '$'
> | 'A'..'Z'
> | 'a'..'z'
> | '_'
> ;
> TERMINATOR
> : '\r'? '\n'
> | ';'
> ;
> WS : (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
> | '...' '\r'? '\n' {$channel=HIDDEN;}
> ;
> /* ************* END ************************ */
>
>
> On Fri, Oct 14, 2011 at 2:12 AM, Michael
> Bedward <michael.bedward at gmail.com> wrote:
>>
>> Ah yes, it's getting stuck on the "b" because you haven't told it that
>> identifiers are atoms...
>>
>> atom
>> : literal
>> | IDENTIFIER
>> | LPAREN! expr RPAREN!
>> ;
>>
>> Michael
>>
>>
>> On 14 October 2011 11:42, Ross Bamford <roscoml at gmail.com> wrote:
>> > Hi Michael,
>> > Thanks for the response! And thanks for being kind about my basic
>> > grammar
>> > :)
>> > I tried reordering the alternatives in expr as you suggested, and am a
>> > bit
>> > closer now than I was before! It's definitely parsing a = 1 + (b = 2)
>> > fine,
>> > but I'm still seeing NoViableAltExceptions with, for example
>> > "a=b+(c=2)".
>> > Looking at the debugger step by step it seems to still be trying to grab
>> > "b+" as a token, rather than seeing the "b" then the "+", which is why I
>> > tried adding IDENTIFIER to the "atom" rule previously. I tried adding it
>> > again after making the change you suggested but it still caused a lot of
>> > problems in other places.
>> > Thanks,
>> > Ross
>> >
>> > On Fri, Oct 14, 2011 at 1:04 AM, Michael Bedward
>> > <michael.bedward at gmail.com>
>> > wrote:
>> >>
>> >> Hi Ross,
>> >>
>> >> For a bit of a newbie that's a nice grammar - much neater than any of
>> >> mine
>> >> :)
>> >>
>> >> If you rearrange your expr rule so that the assign_expr is the first
>> >> alternative...
>> >>
>> >> expr
>> >> : assign_expr
>> >> | math_expr
>> >> | meth_call_expr
>> >> ;
>> >>
>> >> ...I think that the grammar should be able to parse things like a = 1 +
>> >> (b
>> >> = 2)
>> >>
>> >> Michael
>> >>
>> >>
>> >> On 14 October 2011 10:38, Ross Bamford <roscoml at gmail.com> wrote:
>> >> > Hi Guys,
>> >> >
>> >> > I'm a bit of an Antlr newbie - I've successfully created and used
>> >> > Antlr
>> >> > 2
>> >> > grammars in the past but mostly by trial and error, and occasionally
>> >> > random
>> >> > hacking until it "worked"... I've recently become involved in a
>> >> > project
>> >> > that
>> >> > requires a very simple scripting language, and have decided to use
>> >> > Antlr
>> >> > 3
>> >> > for this, but I'm getting stuck quite early on - I think I have a
>> >> > fundamental problem in my grammar but after much hacking at it and
>> >> > trying
>> >> > various ideas I got from Google, I'm still hitting a bit of a brick
>> >> > wall.
>> >> >
>> >> > Basically I'm at the point where I have mathematical expressions and
>> >> > various
>> >> > literal types implemented, and am adding in function and method call
>> >> > handling - I want to be able to call methods with or without and
>> >> > explicit
>> >> > receiver, and in my language parenthesis are optional (I know that
>> >> > complicates matters a bit but it's what I need for this project).
>> >> > I've
>> >> > written the grammar so far against a set of functional tests, and all
>> >> > is
>> >> > well with most of my syntax. Here is my grammar:
>> >> >
>> >> > /* ********* GRAMMAR *********** */
>> >> > grammar BasicLang;
>> >> >
>> >> > options {
>> >> > output=AST;
>> >> > ASTLabelType=CommonTree;
>> >> > backtrack=true;
>> >> > memoize=true;
>> >> > }
>> >> >
>> >> > tokens {
>> >> > ASSIGN;
>> >> > METHOD_CALL;
>> >> > SELF;
>> >> > }
>> >> >
>> >> > @parser::members {
>> >> > /* throw exceptions rather than silently failing... */
>> >> > protected void mismatch(IntStream input, int ttype, BitSet follow)
>> >> > throws RecognitionException
>> >> > {
>> >> > throw new MismatchedTokenException(ttype, input);
>> >> > }
>> >> > public Object recoverFromMismatchedSet(IntStream input,
>> >> > RecognitionException e, BitSet follow)
>> >> > throws RecognitionException
>> >> > {
>> >> > throw e;
>> >> > }
>> >> > }
>> >> >
>> >> > @rulecatch {
>> >> > // throw exceptions rather than silently failing...
>> >> > catch (RecognitionException e) {
>> >> > throw e;
>> >> > }
>> >> > }
>> >> >
>> >> > start_rule
>> >> > : script
>> >> > ;
>> >> >
>> >> > script
>> >> > : statement*
>> >> > ;
>> >> >
>> >> > statement
>> >> > : expr terminator!
>> >> > ;
>> >> >
>> >> > expr
>> >> > : math_expr
>> >> > | assign_expr
>> >> > | meth_call_expr
>> >> > ;
>> >> >
>> >> > meth_call_expr
>> >> > : (IDENTIFIER DOT)? func_call_expr -> ^(METHOD_CALL IDENTIFIER?
>> >> > func_call_expr)
>> >> > | (STRING_LITERAL DOT)? func_call_expr -> ^(METHOD_CALL
>> >> > STRING_LITERAL?
>> >> > func_call_expr)
>> >> > ;
>> >> >
>> >> > fragment
>> >> > func_call_expr
>> >> > : IDENTIFIER^ argument_list
>> >> > ;
>> >> >
>> >> > fragment
>> >> > argument_list
>> >> > : LPAREN!? (expr (COMMA! expr)*)? RPAREN!?
>> >> > ;
>> >> >
>> >> > assign_expr
>> >> > : IDENTIFIER ASSIGN expr -> ^(ASSIGN IDENTIFIER expr)
>> >> > ;
>> >> >
>> >> > math_expr
>> >> > : mult_expr ((ADD^|SUB^) mult_expr)*
>> >> > ;
>> >> >
>> >> > mult_expr
>> >> > : pow_expr ((MUL^|DIV^|MOD^) pow_expr)*
>> >> > ;
>> >> >
>> >> > pow_expr
>> >> > : unary_expr ((POW^) unary_expr)*
>> >> > ;
>> >> >
>> >> > unary_expr
>> >> > : NOT? atom
>> >> > ;
>> >> >
>> >> > atom
>> >> > : literal
>> >> > | LPAREN! expr RPAREN!
>> >> > ;
>> >> >
>> >> > literal
>> >> > : HEX_LITERAL
>> >> > | DECIMAL_LITERAL
>> >> > | OCTAL_LITERAL
>> >> > | FLOATING_POINT_LITERAL
>> >> > // | REGEXP_LITERAL
>> >> > | STRING_LITERAL
>> >> > ;
>> >> >
>> >> > terminator
>> >> > : TERMINATOR
>> >> > | EOF
>> >> > ;
>> >> >
>> >> > POW : '^' ;
>> >> > MOD : '%' ;
>> >> > ADD : '+' ;
>> >> > SUB : '-' ;
>> >> > DIV : '/' ;
>> >> > MUL : '*' ;
>> >> > NOT : '!' ;
>> >> >
>> >> > ASSIGN
>> >> > : '='
>> >> > ;
>> >> >
>> >> > LPAREN
>> >> > : '('
>> >> > ;
>> >> >
>> >> > RPAREN
>> >> > : ')'
>> >> > ;
>> >> >
>> >> > COMMA
>> >> > : ','
>> >> > ;
>> >> >
>> >> > DOT : '.' ;
>> >> >
>> >> > CHARACTER_LITERAL
>> >> > : '\'' ( EscapeSequence | ~('\''|'\\') ) '\''
>> >> > ;
>> >> >
>> >> > STRING_LITERAL
>> >> > : '"' ( EscapeSequence | ~('\\'|'"') )* '"'
>> >> > ;
>> >> >
>> >> > /*
>> >> > REGEXP_LITERAL
>> >> > : '/' ( EscapeSequence | ~('\\'|'"') )* '/'
>> >> > ;
>> >> > */
>> >> >
>> >> > HEX_LITERAL : '0' ('x'|'X') HexDigit+ IntegerTypeSuffix? ;
>> >> >
>> >> > DECIMAL_LITERAL : ('0' | '1'..'9' '0'..'9'*) IntegerTypeSuffix? ;
>> >> >
>> >> > OCTAL_LITERAL : '0' ('0'..'7')+ IntegerTypeSuffix? ;
>> >> >
>> >> > fragment
>> >> > HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
>> >> >
>> >> > fragment
>> >> > IntegerTypeSuffix
>> >> > : ('l'|'L')
>> >> > | ('u'|'U') ('l'|'L')?
>> >> > ;
>> >> >
>> >> > FLOATING_POINT_LITERAL
>> >> > : ('0'..'9')+ '.' ('0'..'9')* Exponent? FloatTypeSuffix?
>> >> > | '.' ('0'..'9')+ Exponent? FloatTypeSuffix?
>> >> > | ('0'..'9')+ Exponent? FloatTypeSuffix?
>> >> > ;
>> >> >
>> >> > fragment
>> >> > Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
>> >> >
>> >> > fragment
>> >> > FloatTypeSuffix : ('f'|'F'|'d'|'D') ;
>> >> >
>> >> > fragment
>> >> > EscapeSequence
>> >> > : '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\'|'/')
>> >> > | OctalEscape
>> >> > ;
>> >> >
>> >> > fragment
>> >> > OctalEscape
>> >> > : '\\' ('0'..'3') ('0'..'7') ('0'..'7')
>> >> > | '\\' ('0'..'7') ('0'..'7')
>> >> > | '\\' ('0'..'7')
>> >> > ;
>> >> >
>> >> > fragment
>> >> > UnicodeEscape
>> >> > : '\\' 'u' HexDigit HexDigit HexDigit HexDigit
>> >> > ;
>> >> > COMMENT
>> >> > : '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
>> >> > ;
>> >> >
>> >> > LINE_COMMENT
>> >> > : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
>> >> > ;
>> >> >
>> >> > IDENTIFIER
>> >> > : ID_LETTER (ID_LETTER|'0'..'9')*
>> >> > ;
>> >> >
>> >> > fragment
>> >> > ID_LETTER
>> >> > : '$'
>> >> > | 'A'..'Z'
>> >> > | 'a'..'z'
>> >> > | '_'
>> >> > ;
>> >> >
>> >> > TERMINATOR
>> >> > : '\r'? '\n'
>> >> > | ';'
>> >> > ;
>> >> >
>> >> > WS : (' '|'\r'|'\t'|'\u000C') {$channel=HIDDEN;}
>> >> > | '...' '\r'? '\n' {$channel=HIDDEN;}
>> >> > ;
>> >> >
>> >> > /* *************** END *************** */
>> >> >
>> >> > With this grammar, my tests so far pass, and I'm building trees for
>> >> > simple
>> >> > arithmetic operations and the like, including involving variables
>> >> > (e.g.
>> >> > a+1
>> >> > and the like), and method calls are working as I expect, including
>> >> > when
>> >> > passing method call results as args to another method call. But I
>> >> > cannot
>> >> > get
>> >> > input such as "a=b+(c=1)" to parse at all - Debugging in AntlrWorks
>> >> > shows me
>> >> > that the problem occurs when the parse sees the "b+", when it throws
>> >> > a
>> >> > NoViableAlt exception.
>> >> >
>> >> > I guessed this was because the parser doesn't see the identifier as
>> >> > an
>> >> > atom,
>> >> > so tries to parse it with the + symbol. So, I tried adding IDENTIFIER
>> >> > as
>> >> > an
>> >> > alternative to the atom rule - but that just broke the parser
>> >> > completely
>> >> > and
>> >> > many of my tests failed with an exception - MismatchedSetException.
>> >> >
>> >> > I've been playing with this for a few days now but no matter what I
>> >> > do,
>> >> > even
>> >> > when I get the type of syntax I mentioned above (the assign
>> >> > statement)
>> >> > working, I invariably break something (or more often, everything! :(
>> >> > )
>> >> > else.
>> >> > I'm really hoping someone out there will take pity on me and give me
>> >> > some
>> >> > insight into what I'm doing wrong.
>> >> >
>> >> > Thanks in advance!
>> >> > --
>> >> > Ross Bamford - roscoml at gmail.com
>> >> >
>> >> > List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> >> > Unsubscribe:
>> >>
>> >> > > http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>> >> >
>> >
>> >
>>
>
More information about the antlr-interest
mailing list