[antlr-interest] An ambiguous lexing problem

Tue Aug 28 15:58:54 PDT 2012

Yow. I had a look through the FAQ, but I somehow missed that one, even
though it has my problem right in the name. Thanks for the clue. I'll
go play with that and see what I come up with - thankfully, LambdaMOO
is considerably simpler than JavaFX…

-o

On 2012-08-28, at 6:46 PM, Jim Idle <jimi at temporal-wave.com> wrote:

> See FAQ:
> 
> 
> http://tinyurl.com/8t4pnhv
> 
> 
> Jim
> 
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Owen Jacobson
>> Sent: Tuesday, August 28, 2012 3:35 PM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] An ambiguous lexing problem
>> 
>> Hi folks,
>> 
>> I'm tinkering with an Antlr grammar for the LambdaMOO programming
>> language, largely documented at
>> http://www.hayseed.net/MOO/manuals/ProgrammersManual_toc.html. The
>> existing implementation is in very hoary C, originally built in the
>> early 90s using yacc and a hand-rolled lexer (you can see both on
>> Github: https://github.com/wrog/lambdamoo/blob/master/parser.y contains
>> the yacc grammar and the yylex function required to feed it).
>> 
>> The language permits floating-point literals of the forms "1."
>> (evaluating to 1.0) and ".3" (evaluating to 0.3), along with the usual
>> suite of scientific notation options. It also permits "range"
>> expressions for "for" loops and for indexing into lists, which have the
>> form '[' expression '..' expression ']'. The C implementation parses
>> 
>> 	for x in [1..3]
>> 		". Loop body .";
>> 	endfor
>> 
>> as a loop over 1, 2, and 3. Given the lexer rules I've worked out so
>> far, my parser turns the "[1..3]" part into a few syntax errors and
>> then applies error recovery to the range, producing a loop from 1. to
>> .3. Obviously, this is not what I want.
>> 
>> Experimenting with LambdaMOO's evaluator demonstrates that ranges like
>> "[1..3]" compile successfuly but fail at runtime (one or the other
>> value is interpreted as a float; I don't care which one, as the error
>> is the same either way) and that ranges like "[1....4]" are syntax
>> errors, so it's a pretty constrained special case in the existing
>> lexer.
>> 
>> My existing grammar is at the end of this mail. As I'm generating
>> Python from Antlr, I'm running Antlr 3.1.3. I've also included a sample
>> input that demonstrates the problem, which you can feed to the
>> generated parser using
>> 
>> 	python langParser.py --rule program for-range.moo
>> 
>> I spent some time staring at Antlr's generated code and experimenting
>> with syntactic and semantic predicates on the INT and FLOAT lexer
>> rules, to no avail. I'd love a pointer in the right direction; I
>> understand why Antlr is picking the token sequence it's picking, since
>> the production for FLOAT really does permit the sequence the parser's
>> receiving and it's the longest match at each stage of lexical analysis,
>> but I don't know how to (or if it's even possible to) special case
>> sequences like this to produce the tokens I really want.
>> 
>> It occurred to me to lift numeric literals out of the lexer and into
>> the parser, but the existing implementation doesn't permit whitespace
>> within numeric literals and I don't want to litter the grammar with
>> "and whitespace is allowed here, and here, and here" dummy rules if I
>> take the { $channel = HIDDEN; } off of WHITESPACE. If this is the only
>> way to get what I want, obviously I'll do it, but it feels like there
>> should be something more localized for solving this. (I could also
>> reproduce the hand-rolled lexer, but. yuck.)
>> 
>> I also thought about negative lookahead, but Antlr's lexer doesn't
>> support lookahead or lookbehind assertions, so that's out.
>> 
>> Help!
>> 
>> -o
>> 
>> ---- lang.g ----
>> grammar lang;
>> 
>> options {
>>    language=Python;
>>    output=AST;
>>    ASTLabelType=CommonTree;
>> }
>> 
>> tokens {
>>    BLOCK;
>>    PROGRAM;
>>    STATEMENT;
>>    LOOP_TAG;
>> }
>> 
>> program
>>    : statement* EOF
>>        -> ^(PROGRAM statement*)
>>    ;
>> 
>> statement
>>    :   simple_statement ';'
>>        -> simple_statement
>>    |   if_statement
>>    |   while_statement
>>    |   for_statement
>>    |   ';'
>>        ->
>>    ;
>> 
>> simple_statement
>>    :   expression
>>        -> ^(STATEMENT expression)
>>    |   RETURN expression?
>>        -> ^(RETURN expression?)
>>    ;
>> 
>> // This writes out the branches if the IF statement *BACKWARDS*. Be
>> careful!
>> // Doing it the "wrong" way around makes generating jump targets in
>> compile.g // way easier.
>> if_statement
>>    :   if_part elseif_parts? else_part? ENDIF
>>        -> ^(ENDIF else_part? elseif_parts? if_part)
>>    ;
>> 
>> if_part
>>    :   IF condition statement*
>>        -> ^(IF condition statement*)
>>    ;
>> 
>> elseif_parts
>>    :   ELSEIF condition statement* elseif_parts?
>>        -> elseif_parts? ^(ELSEIF condition statement*)
>>    ;
>> 
>> else_part
>>    :   ELSE statement*
>>        -> ^(ELSE statement*)
>>    ;
>> 
>> while_statement
>>    :   WHILE condition statement* ENDWHILE
>>        -> ^(ENDWHILE condition statement*)
>>    |   WHILE IDENTIFIER condition statement* ENDWHILE
>>        -> ^(ENDWHILE ^(LOOP_TAG IDENTIFIER) condition statement*)
>>    ;
>> 
>> for_statement
>>    :   for_list_statement
>>    |   for_range_statement
>>    ;
>> 
>> for_list_statement
>>    :   FOR IDENTIFIER IN '(' expression ')' statement* ENDFOR
>>        -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) expression statement*)
>>    ;
>> 
>> for_range_statement
>>    :   FOR IDENTIFIER IN range statement* ENDFOR
>>        -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) range statement*)
>>    ;
>> 
>> range
>>    :   RANGE_START start=expression TO end=expression RANGE_END
>>        -> ^(RANGE_START $start $end)
>>    ;
>> 
>> condition
>>    :   '(' expression ')'
>>        -> expression
>>    ;
>> 
>> expression
>>    :   root_expression
>>    ;
>> 
>> root_expression
>>    :   literal
>>    |   IDENTIFIER
>>    ;
>> 
>> literal
>>    :   INT
>>    |   FLOAT
>>    |   STRING
>>    |   OBJECT_NUM
>>    |   ERROR
>>    |   list_literal
>>    ;
>> 
>> list_literal
>>    :   LIST_START list_body? LIST_END
>>        -> ^(LIST_START list_body?)
>>    ;
>> 
>> list_body
>>    :   list_element (',' list_element)*
>>        -> list_element*
>>    ;
>> 
>> list_element
>>    :   expression
>>    |   list_splice
>>    ;
>> 
>> list_splice
>>    :   '@' expression
>>        -> ^('@' expression)
>>    ;
>> 
>> // --------------------
>> // Fixed-content tokens
>> // --------------------
>> 
>> ERROR
>>    :   'e_range'
>>    |   'e_recmove'
>>    |   'e_none'
>>    |   'e_propnf'
>>    |   'e_quota'
>>    |   'e_div'
>>    |   'e_args'
>>    |   'e_varnf'
>>    |   'e_verbnf'
>>    |   'e_perm'
>>    |   'e_invind'
>>    |   'e_nacc'
>>    |   'e_type'
>>    |   'e_float'
>>    |   'e_invarg'
>>    |   'e_maxrec'
>>    ;
>> 
>> ANY: 'any';
>> IF: 'if';
>> ELSEIF: 'elseif';
>> ELSE: 'else';
>> ENDIF: 'endif';
>> WHILE: 'while';
>> ENDWHILE: 'endwhile';
>> FOR: 'for';
>> IN: 'in';
>> ENDFOR: 'endfor';
>> RETURN: 'return';
>> 
>> LIST_START: '{';
>> LIST_END: '}';
>> RANGE_START: '[';
>> RANGE_END: ']';
>> 
>> DOT: '.';
>> TO: '..';
>> 
>> // -----------------------
>> // Variable-content tokens
>> // -----------------------
>> 
>> // printable ASCII, minus double quote and backslash.
>> fragment STRING_CHAR
>>    :   '\u0020'..'\u0021'
>>    |   '\u0023'..'\u005B'
>>    |   '\u005D'..'\u007E'
>>    ;
>> 
>> fragment STRING_ESCAPE
>>    :   '\\' ('"' | '\\')
>>    ;
>> 
>> STRING: '"' (STRING_CHAR | STRING_ESCAPE) * '"';
>> 
>> fragment DIGIT: '0'..'9';
>> fragment SIGN: '-'?;
>> 
>> INT: SIGN DIGIT+;
>> 
>> fragment FLOAT_EXPONENT
>>    :   'e' ('+'|'-')? DIGIT+
>>    ;
>> 
>> FLOAT
>>    :   SIGN DIGIT+ FLOAT_EXPONENT
>>    |   SIGN DIGIT+ DOT DIGIT* FLOAT_EXPONENT?
>>    |   SIGN DOT DIGIT+ FLOAT_EXPONENT?
>>    ;
>> 
>> OBJECT_NUM
>>    :   '#' '-'? DIGIT+
>>    ;
>> 
>> fragment IDENT_FIRST_CHAR
>>    :   'a'..'z' | '_'
>>    ;
>> 
>> fragment IDENT_CHAR
>>    :   IDENT_FIRST_CHAR | DIGIT
>>    ;
>> 
>> IDENTIFIER
>>    :   IDENT_FIRST_CHAR IDENT_CHAR*
>>    ;
>> 
>> // Newlines aren't technically legal in MOO strings, where most code
>> comes // from. However, permitting them in the language means source
>> from files can // be compiled without stripping newlines in advance.
>> WHITESPACE
>>    :   (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; }
>>    ;
>> 
>> COMMENT
>>    :   '/*' .* '*/' { $channel = HIDDEN; }
>>    ;
>> 
>> /* eof */
>> ---- for-range.moo ----
>> for x in [1..5]
>> endfor
>> 
>> /* This should parse identically to the above. */ for x in [1 .. 5]
>> endfor
>> 
>> for y in [6..10]
>>    "Non-empty.";
>> endfor
>> /* eof */
>> 
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address