[antlr-interest] An ambiguous lexing problem

Tue Aug 28 15:46:28 PDT 2012

See FAQ:

http://tinyurl.com/8t4pnhv

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Owen Jacobson
> Sent: Tuesday, August 28, 2012 3:35 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] An ambiguous lexing problem
>
> Hi folks,
>
> I'm tinkering with an Antlr grammar for the LambdaMOO programming
> language, largely documented at
> http://www.hayseed.net/MOO/manuals/ProgrammersManual_toc.html. The
> existing implementation is in very hoary C, originally built in the
> early 90s using yacc and a hand-rolled lexer (you can see both on
> Github: https://github.com/wrog/lambdamoo/blob/master/parser.y contains
> the yacc grammar and the yylex function required to feed it).
>
> The language permits floating-point literals of the forms "1."
> (evaluating to 1.0) and ".3" (evaluating to 0.3), along with the usual
> suite of scientific notation options. It also permits "range"
> expressions for "for" loops and for indexing into lists, which have the
> form '[' expression '..' expression ']'. The C implementation parses
>
> 	for x in [1..3]
> 		". Loop body .";
> 	endfor
>
> as a loop over 1, 2, and 3. Given the lexer rules I've worked out so
> far, my parser turns the "[1..3]" part into a few syntax errors and
> then applies error recovery to the range, producing a loop from 1. to
> .3. Obviously, this is not what I want.
>
> Experimenting with LambdaMOO's evaluator demonstrates that ranges like
> "[1..3]" compile successfuly but fail at runtime (one or the other
> value is interpreted as a float; I don't care which one, as the error
> is the same either way) and that ranges like "[1....4]" are syntax
> errors, so it's a pretty constrained special case in the existing
> lexer.
>
> My existing grammar is at the end of this mail. As I'm generating
> Python from Antlr, I'm running Antlr 3.1.3. I've also included a sample
> input that demonstrates the problem, which you can feed to the
> generated parser using
>
> 	python langParser.py --rule program for-range.moo
>
> I spent some time staring at Antlr's generated code and experimenting
> with syntactic and semantic predicates on the INT and FLOAT lexer
> rules, to no avail. I'd love a pointer in the right direction; I
> understand why Antlr is picking the token sequence it's picking, since
> the production for FLOAT really does permit the sequence the parser's
> receiving and it's the longest match at each stage of lexical analysis,
> but I don't know how to (or if it's even possible to) special case
> sequences like this to produce the tokens I really want.
>
> It occurred to me to lift numeric literals out of the lexer and into
> the parser, but the existing implementation doesn't permit whitespace
> within numeric literals and I don't want to litter the grammar with
> "and whitespace is allowed here, and here, and here" dummy rules if I
> take the { $channel = HIDDEN; } off of WHITESPACE. If this is the only
> way to get what I want, obviously I'll do it, but it feels like there
> should be something more localized for solving this. (I could also
> reproduce the hand-rolled lexer, but. yuck.)
>
> I also thought about negative lookahead, but Antlr's lexer doesn't
> support lookahead or lookbehind assertions, so that's out.
>
> Help!
>
> -o
>
> ---- lang.g ----
> grammar lang;
>
> options {
>     language=Python;
>     output=AST;
>     ASTLabelType=CommonTree;
> }
>
> tokens {
>     BLOCK;
>     PROGRAM;
>     STATEMENT;
>     LOOP_TAG;
> }
>
> program
>     : statement* EOF
>         -> ^(PROGRAM statement*)
>     ;
>
> statement
>     :   simple_statement ';'
>         -> simple_statement
>     |   if_statement
>     |   while_statement
>     |   for_statement
>     |   ';'
>         ->
>     ;
>
> simple_statement
>     :   expression
>         -> ^(STATEMENT expression)
>     |   RETURN expression?
>         -> ^(RETURN expression?)
>     ;
>
> // This writes out the branches if the IF statement *BACKWARDS*. Be
> careful!
> // Doing it the "wrong" way around makes generating jump targets in
> compile.g // way easier.
> if_statement
>     :   if_part elseif_parts? else_part? ENDIF
>         -> ^(ENDIF else_part? elseif_parts? if_part)
>     ;
>
> if_part
>     :   IF condition statement*
>         -> ^(IF condition statement*)
>     ;
>
> elseif_parts
>     :   ELSEIF condition statement* elseif_parts?
>         -> elseif_parts? ^(ELSEIF condition statement*)
>     ;
>
> else_part
>     :   ELSE statement*
>         -> ^(ELSE statement*)
>     ;
>
> while_statement
>     :   WHILE condition statement* ENDWHILE
>         -> ^(ENDWHILE condition statement*)
>     |   WHILE IDENTIFIER condition statement* ENDWHILE
>         -> ^(ENDWHILE ^(LOOP_TAG IDENTIFIER) condition statement*)
>     ;
>
> for_statement
>     :   for_list_statement
>     |   for_range_statement
>     ;
>
> for_list_statement
>     :   FOR IDENTIFIER IN '(' expression ')' statement* ENDFOR
>         -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) expression statement*)
>     ;
>
> for_range_statement
>     :   FOR IDENTIFIER IN range statement* ENDFOR
>         -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) range statement*)
>     ;
>
> range
>     :   RANGE_START start=expression TO end=expression RANGE_END
>         -> ^(RANGE_START $start $end)
>     ;
>
> condition
>     :   '(' expression ')'
>         -> expression
>     ;
>
> expression
>     :   root_expression
>     ;
>
> root_expression
>     :   literal
>     |   IDENTIFIER
>     ;
>
> literal
>     :   INT
>     |   FLOAT
>     |   STRING
>     |   OBJECT_NUM
>     |   ERROR
>     |   list_literal
>     ;
>
> list_literal
>     :   LIST_START list_body? LIST_END
>         -> ^(LIST_START list_body?)
>     ;
>
> list_body
>     :   list_element (',' list_element)*
>         -> list_element*
>     ;
>
> list_element
>     :   expression
>     |   list_splice
>     ;
>
> list_splice
>     :   '@' expression
>         -> ^('@' expression)
>     ;
>
> // --------------------
> // Fixed-content tokens
> // --------------------
>
> ERROR
>     :   'e_range'
>     |   'e_recmove'
>     |   'e_none'
>     |   'e_propnf'
>     |   'e_quota'
>     |   'e_div'
>     |   'e_args'
>     |   'e_varnf'
>     |   'e_verbnf'
>     |   'e_perm'
>     |   'e_invind'
>     |   'e_nacc'
>     |   'e_type'
>     |   'e_float'
>     |   'e_invarg'
>     |   'e_maxrec'
>     ;
>
> ANY: 'any';
> IF: 'if';
> ELSEIF: 'elseif';
> ELSE: 'else';
> ENDIF: 'endif';
> WHILE: 'while';
> ENDWHILE: 'endwhile';
> FOR: 'for';
> IN: 'in';
> ENDFOR: 'endfor';
> RETURN: 'return';
>
> LIST_START: '{';
> LIST_END: '}';
> RANGE_START: '[';
> RANGE_END: ']';
>
> DOT: '.';
> TO: '..';
>
> // -----------------------
> // Variable-content tokens
> // -----------------------
>
> // printable ASCII, minus double quote and backslash.
> fragment STRING_CHAR
>     :   '\u0020'..'\u0021'
>     |   '\u0023'..'\u005B'
>     |   '\u005D'..'\u007E'
>     ;
>
> fragment STRING_ESCAPE
>     :   '\\' ('"' | '\\')
>     ;
>
> STRING: '"' (STRING_CHAR | STRING_ESCAPE) * '"';
>
> fragment DIGIT: '0'..'9';
> fragment SIGN: '-'?;
>
> INT: SIGN DIGIT+;
>
> fragment FLOAT_EXPONENT
>     :   'e' ('+'|'-')? DIGIT+
>     ;
>
> FLOAT
>     :   SIGN DIGIT+ FLOAT_EXPONENT
>     |   SIGN DIGIT+ DOT DIGIT* FLOAT_EXPONENT?
>     |   SIGN DOT DIGIT+ FLOAT_EXPONENT?
>     ;
>
> OBJECT_NUM
>     :   '#' '-'? DIGIT+
>     ;
>
> fragment IDENT_FIRST_CHAR
>     :   'a'..'z' | '_'
>     ;
>
> fragment IDENT_CHAR
>     :   IDENT_FIRST_CHAR | DIGIT
>     ;
>
> IDENTIFIER
>     :   IDENT_FIRST_CHAR IDENT_CHAR*
>     ;
>
> // Newlines aren't technically legal in MOO strings, where most code
> comes // from. However, permitting them in the language means source
> from files can // be compiled without stripping newlines in advance.
> WHITESPACE
>     :   (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; }
>     ;
>
> COMMENT
>     :   '/*' .* '*/' { $channel = HIDDEN; }
>     ;
>
> /* eof */
> ---- for-range.moo ----
> for x in [1..5]
> endfor
>
> /* This should parse identically to the above. */ for x in [1 .. 5]
> endfor
>
> for y in [6..10]
>     "Non-empty.";
> endfor
> /* eof */
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address