[antlr-interest] An ambiguous lexing problem
Jim Idle
jimi at temporal-wave.com
Tue Aug 28 15:46:28 PDT 2012
See FAQ:
http://tinyurl.com/8t4pnhv
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Owen Jacobson
> Sent: Tuesday, August 28, 2012 3:35 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] An ambiguous lexing problem
>
> Hi folks,
>
> I'm tinkering with an Antlr grammar for the LambdaMOO programming
> language, largely documented at
> http://www.hayseed.net/MOO/manuals/ProgrammersManual_toc.html. The
> existing implementation is in very hoary C, originally built in the
> early 90s using yacc and a hand-rolled lexer (you can see both on
> Github: https://github.com/wrog/lambdamoo/blob/master/parser.y contains
> the yacc grammar and the yylex function required to feed it).
>
> The language permits floating-point literals of the forms "1."
> (evaluating to 1.0) and ".3" (evaluating to 0.3), along with the usual
> suite of scientific notation options. It also permits "range"
> expressions for "for" loops and for indexing into lists, which have the
> form '[' expression '..' expression ']'. The C implementation parses
>
> for x in [1..3]
> ". Loop body .";
> endfor
>
> as a loop over 1, 2, and 3. Given the lexer rules I've worked out so
> far, my parser turns the "[1..3]" part into a few syntax errors and
> then applies error recovery to the range, producing a loop from 1. to
> .3. Obviously, this is not what I want.
>
> Experimenting with LambdaMOO's evaluator demonstrates that ranges like
> "[1..3]" compile successfuly but fail at runtime (one or the other
> value is interpreted as a float; I don't care which one, as the error
> is the same either way) and that ranges like "[1....4]" are syntax
> errors, so it's a pretty constrained special case in the existing
> lexer.
>
> My existing grammar is at the end of this mail. As I'm generating
> Python from Antlr, I'm running Antlr 3.1.3. I've also included a sample
> input that demonstrates the problem, which you can feed to the
> generated parser using
>
> python langParser.py --rule program for-range.moo
>
> I spent some time staring at Antlr's generated code and experimenting
> with syntactic and semantic predicates on the INT and FLOAT lexer
> rules, to no avail. I'd love a pointer in the right direction; I
> understand why Antlr is picking the token sequence it's picking, since
> the production for FLOAT really does permit the sequence the parser's
> receiving and it's the longest match at each stage of lexical analysis,
> but I don't know how to (or if it's even possible to) special case
> sequences like this to produce the tokens I really want.
>
> It occurred to me to lift numeric literals out of the lexer and into
> the parser, but the existing implementation doesn't permit whitespace
> within numeric literals and I don't want to litter the grammar with
> "and whitespace is allowed here, and here, and here" dummy rules if I
> take the { $channel = HIDDEN; } off of WHITESPACE. If this is the only
> way to get what I want, obviously I'll do it, but it feels like there
> should be something more localized for solving this. (I could also
> reproduce the hand-rolled lexer, but. yuck.)
>
> I also thought about negative lookahead, but Antlr's lexer doesn't
> support lookahead or lookbehind assertions, so that's out.
>
> Help!
>
> -o
>
> ---- lang.g ----
> grammar lang;
>
> options {
> language=Python;
> output=AST;
> ASTLabelType=CommonTree;
> }
>
> tokens {
> BLOCK;
> PROGRAM;
> STATEMENT;
> LOOP_TAG;
> }
>
> program
> : statement* EOF
> -> ^(PROGRAM statement*)
> ;
>
> statement
> : simple_statement ';'
> -> simple_statement
> | if_statement
> | while_statement
> | for_statement
> | ';'
> ->
> ;
>
> simple_statement
> : expression
> -> ^(STATEMENT expression)
> | RETURN expression?
> -> ^(RETURN expression?)
> ;
>
> // This writes out the branches if the IF statement *BACKWARDS*. Be
> careful!
> // Doing it the "wrong" way around makes generating jump targets in
> compile.g // way easier.
> if_statement
> : if_part elseif_parts? else_part? ENDIF
> -> ^(ENDIF else_part? elseif_parts? if_part)
> ;
>
> if_part
> : IF condition statement*
> -> ^(IF condition statement*)
> ;
>
> elseif_parts
> : ELSEIF condition statement* elseif_parts?
> -> elseif_parts? ^(ELSEIF condition statement*)
> ;
>
> else_part
> : ELSE statement*
> -> ^(ELSE statement*)
> ;
>
> while_statement
> : WHILE condition statement* ENDWHILE
> -> ^(ENDWHILE condition statement*)
> | WHILE IDENTIFIER condition statement* ENDWHILE
> -> ^(ENDWHILE ^(LOOP_TAG IDENTIFIER) condition statement*)
> ;
>
> for_statement
> : for_list_statement
> | for_range_statement
> ;
>
> for_list_statement
> : FOR IDENTIFIER IN '(' expression ')' statement* ENDFOR
> -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) expression statement*)
> ;
>
> for_range_statement
> : FOR IDENTIFIER IN range statement* ENDFOR
> -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) range statement*)
> ;
>
> range
> : RANGE_START start=expression TO end=expression RANGE_END
> -> ^(RANGE_START $start $end)
> ;
>
> condition
> : '(' expression ')'
> -> expression
> ;
>
> expression
> : root_expression
> ;
>
> root_expression
> : literal
> | IDENTIFIER
> ;
>
> literal
> : INT
> | FLOAT
> | STRING
> | OBJECT_NUM
> | ERROR
> | list_literal
> ;
>
> list_literal
> : LIST_START list_body? LIST_END
> -> ^(LIST_START list_body?)
> ;
>
> list_body
> : list_element (',' list_element)*
> -> list_element*
> ;
>
> list_element
> : expression
> | list_splice
> ;
>
> list_splice
> : '@' expression
> -> ^('@' expression)
> ;
>
> // --------------------
> // Fixed-content tokens
> // --------------------
>
> ERROR
> : 'e_range'
> | 'e_recmove'
> | 'e_none'
> | 'e_propnf'
> | 'e_quota'
> | 'e_div'
> | 'e_args'
> | 'e_varnf'
> | 'e_verbnf'
> | 'e_perm'
> | 'e_invind'
> | 'e_nacc'
> | 'e_type'
> | 'e_float'
> | 'e_invarg'
> | 'e_maxrec'
> ;
>
> ANY: 'any';
> IF: 'if';
> ELSEIF: 'elseif';
> ELSE: 'else';
> ENDIF: 'endif';
> WHILE: 'while';
> ENDWHILE: 'endwhile';
> FOR: 'for';
> IN: 'in';
> ENDFOR: 'endfor';
> RETURN: 'return';
>
> LIST_START: '{';
> LIST_END: '}';
> RANGE_START: '[';
> RANGE_END: ']';
>
> DOT: '.';
> TO: '..';
>
> // -----------------------
> // Variable-content tokens
> // -----------------------
>
> // printable ASCII, minus double quote and backslash.
> fragment STRING_CHAR
> : '\u0020'..'\u0021'
> | '\u0023'..'\u005B'
> | '\u005D'..'\u007E'
> ;
>
> fragment STRING_ESCAPE
> : '\\' ('"' | '\\')
> ;
>
> STRING: '"' (STRING_CHAR | STRING_ESCAPE) * '"';
>
> fragment DIGIT: '0'..'9';
> fragment SIGN: '-'?;
>
> INT: SIGN DIGIT+;
>
> fragment FLOAT_EXPONENT
> : 'e' ('+'|'-')? DIGIT+
> ;
>
> FLOAT
> : SIGN DIGIT+ FLOAT_EXPONENT
> | SIGN DIGIT+ DOT DIGIT* FLOAT_EXPONENT?
> | SIGN DOT DIGIT+ FLOAT_EXPONENT?
> ;
>
> OBJECT_NUM
> : '#' '-'? DIGIT+
> ;
>
> fragment IDENT_FIRST_CHAR
> : 'a'..'z' | '_'
> ;
>
> fragment IDENT_CHAR
> : IDENT_FIRST_CHAR | DIGIT
> ;
>
> IDENTIFIER
> : IDENT_FIRST_CHAR IDENT_CHAR*
> ;
>
> // Newlines aren't technically legal in MOO strings, where most code
> comes // from. However, permitting them in the language means source
> from files can // be compiled without stripping newlines in advance.
> WHITESPACE
> : (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; }
> ;
>
> COMMENT
> : '/*' .* '*/' { $channel = HIDDEN; }
> ;
>
> /* eof */
> ---- for-range.moo ----
> for x in [1..5]
> endfor
>
> /* This should parse identically to the above. */ for x in [1 .. 5]
> endfor
>
> for y in [6..10]
> "Non-empty.";
> endfor
> /* eof */
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
More information about the antlr-interest
mailing list