[antlr-interest] An ambiguous lexing problem
Owen Jacobson
owen.jacobson at grimoire.ca
Tue Aug 28 15:58:54 PDT 2012
Yow. I had a look through the FAQ, but I somehow missed that one, even
though it has my problem right in the name. Thanks for the clue. I'll
go play with that and see what I come up with - thankfully, LambdaMOO
is considerably simpler than JavaFX…
-o
On 2012-08-28, at 6:46 PM, Jim Idle <jimi at temporal-wave.com> wrote:
> See FAQ:
>
>
> http://tinyurl.com/8t4pnhv
>
>
> Jim
>
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Owen Jacobson
>> Sent: Tuesday, August 28, 2012 3:35 PM
>> To: antlr-interest at antlr.org
>> Subject: [antlr-interest] An ambiguous lexing problem
>>
>> Hi folks,
>>
>> I'm tinkering with an Antlr grammar for the LambdaMOO programming
>> language, largely documented at
>> http://www.hayseed.net/MOO/manuals/ProgrammersManual_toc.html. The
>> existing implementation is in very hoary C, originally built in the
>> early 90s using yacc and a hand-rolled lexer (you can see both on
>> Github: https://github.com/wrog/lambdamoo/blob/master/parser.y contains
>> the yacc grammar and the yylex function required to feed it).
>>
>> The language permits floating-point literals of the forms "1."
>> (evaluating to 1.0) and ".3" (evaluating to 0.3), along with the usual
>> suite of scientific notation options. It also permits "range"
>> expressions for "for" loops and for indexing into lists, which have the
>> form '[' expression '..' expression ']'. The C implementation parses
>>
>> for x in [1..3]
>> ". Loop body .";
>> endfor
>>
>> as a loop over 1, 2, and 3. Given the lexer rules I've worked out so
>> far, my parser turns the "[1..3]" part into a few syntax errors and
>> then applies error recovery to the range, producing a loop from 1. to
>> .3. Obviously, this is not what I want.
>>
>> Experimenting with LambdaMOO's evaluator demonstrates that ranges like
>> "[1..3]" compile successfuly but fail at runtime (one or the other
>> value is interpreted as a float; I don't care which one, as the error
>> is the same either way) and that ranges like "[1....4]" are syntax
>> errors, so it's a pretty constrained special case in the existing
>> lexer.
>>
>> My existing grammar is at the end of this mail. As I'm generating
>> Python from Antlr, I'm running Antlr 3.1.3. I've also included a sample
>> input that demonstrates the problem, which you can feed to the
>> generated parser using
>>
>> python langParser.py --rule program for-range.moo
>>
>> I spent some time staring at Antlr's generated code and experimenting
>> with syntactic and semantic predicates on the INT and FLOAT lexer
>> rules, to no avail. I'd love a pointer in the right direction; I
>> understand why Antlr is picking the token sequence it's picking, since
>> the production for FLOAT really does permit the sequence the parser's
>> receiving and it's the longest match at each stage of lexical analysis,
>> but I don't know how to (or if it's even possible to) special case
>> sequences like this to produce the tokens I really want.
>>
>> It occurred to me to lift numeric literals out of the lexer and into
>> the parser, but the existing implementation doesn't permit whitespace
>> within numeric literals and I don't want to litter the grammar with
>> "and whitespace is allowed here, and here, and here" dummy rules if I
>> take the { $channel = HIDDEN; } off of WHITESPACE. If this is the only
>> way to get what I want, obviously I'll do it, but it feels like there
>> should be something more localized for solving this. (I could also
>> reproduce the hand-rolled lexer, but. yuck.)
>>
>> I also thought about negative lookahead, but Antlr's lexer doesn't
>> support lookahead or lookbehind assertions, so that's out.
>>
>> Help!
>>
>> -o
>>
>> ---- lang.g ----
>> grammar lang;
>>
>> options {
>> language=Python;
>> output=AST;
>> ASTLabelType=CommonTree;
>> }
>>
>> tokens {
>> BLOCK;
>> PROGRAM;
>> STATEMENT;
>> LOOP_TAG;
>> }
>>
>> program
>> : statement* EOF
>> -> ^(PROGRAM statement*)
>> ;
>>
>> statement
>> : simple_statement ';'
>> -> simple_statement
>> | if_statement
>> | while_statement
>> | for_statement
>> | ';'
>> ->
>> ;
>>
>> simple_statement
>> : expression
>> -> ^(STATEMENT expression)
>> | RETURN expression?
>> -> ^(RETURN expression?)
>> ;
>>
>> // This writes out the branches if the IF statement *BACKWARDS*. Be
>> careful!
>> // Doing it the "wrong" way around makes generating jump targets in
>> compile.g // way easier.
>> if_statement
>> : if_part elseif_parts? else_part? ENDIF
>> -> ^(ENDIF else_part? elseif_parts? if_part)
>> ;
>>
>> if_part
>> : IF condition statement*
>> -> ^(IF condition statement*)
>> ;
>>
>> elseif_parts
>> : ELSEIF condition statement* elseif_parts?
>> -> elseif_parts? ^(ELSEIF condition statement*)
>> ;
>>
>> else_part
>> : ELSE statement*
>> -> ^(ELSE statement*)
>> ;
>>
>> while_statement
>> : WHILE condition statement* ENDWHILE
>> -> ^(ENDWHILE condition statement*)
>> | WHILE IDENTIFIER condition statement* ENDWHILE
>> -> ^(ENDWHILE ^(LOOP_TAG IDENTIFIER) condition statement*)
>> ;
>>
>> for_statement
>> : for_list_statement
>> | for_range_statement
>> ;
>>
>> for_list_statement
>> : FOR IDENTIFIER IN '(' expression ')' statement* ENDFOR
>> -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) expression statement*)
>> ;
>>
>> for_range_statement
>> : FOR IDENTIFIER IN range statement* ENDFOR
>> -> ^(ENDFOR ^(LOOP_TAG IDENTIFIER) range statement*)
>> ;
>>
>> range
>> : RANGE_START start=expression TO end=expression RANGE_END
>> -> ^(RANGE_START $start $end)
>> ;
>>
>> condition
>> : '(' expression ')'
>> -> expression
>> ;
>>
>> expression
>> : root_expression
>> ;
>>
>> root_expression
>> : literal
>> | IDENTIFIER
>> ;
>>
>> literal
>> : INT
>> | FLOAT
>> | STRING
>> | OBJECT_NUM
>> | ERROR
>> | list_literal
>> ;
>>
>> list_literal
>> : LIST_START list_body? LIST_END
>> -> ^(LIST_START list_body?)
>> ;
>>
>> list_body
>> : list_element (',' list_element)*
>> -> list_element*
>> ;
>>
>> list_element
>> : expression
>> | list_splice
>> ;
>>
>> list_splice
>> : '@' expression
>> -> ^('@' expression)
>> ;
>>
>> // --------------------
>> // Fixed-content tokens
>> // --------------------
>>
>> ERROR
>> : 'e_range'
>> | 'e_recmove'
>> | 'e_none'
>> | 'e_propnf'
>> | 'e_quota'
>> | 'e_div'
>> | 'e_args'
>> | 'e_varnf'
>> | 'e_verbnf'
>> | 'e_perm'
>> | 'e_invind'
>> | 'e_nacc'
>> | 'e_type'
>> | 'e_float'
>> | 'e_invarg'
>> | 'e_maxrec'
>> ;
>>
>> ANY: 'any';
>> IF: 'if';
>> ELSEIF: 'elseif';
>> ELSE: 'else';
>> ENDIF: 'endif';
>> WHILE: 'while';
>> ENDWHILE: 'endwhile';
>> FOR: 'for';
>> IN: 'in';
>> ENDFOR: 'endfor';
>> RETURN: 'return';
>>
>> LIST_START: '{';
>> LIST_END: '}';
>> RANGE_START: '[';
>> RANGE_END: ']';
>>
>> DOT: '.';
>> TO: '..';
>>
>> // -----------------------
>> // Variable-content tokens
>> // -----------------------
>>
>> // printable ASCII, minus double quote and backslash.
>> fragment STRING_CHAR
>> : '\u0020'..'\u0021'
>> | '\u0023'..'\u005B'
>> | '\u005D'..'\u007E'
>> ;
>>
>> fragment STRING_ESCAPE
>> : '\\' ('"' | '\\')
>> ;
>>
>> STRING: '"' (STRING_CHAR | STRING_ESCAPE) * '"';
>>
>> fragment DIGIT: '0'..'9';
>> fragment SIGN: '-'?;
>>
>> INT: SIGN DIGIT+;
>>
>> fragment FLOAT_EXPONENT
>> : 'e' ('+'|'-')? DIGIT+
>> ;
>>
>> FLOAT
>> : SIGN DIGIT+ FLOAT_EXPONENT
>> | SIGN DIGIT+ DOT DIGIT* FLOAT_EXPONENT?
>> | SIGN DOT DIGIT+ FLOAT_EXPONENT?
>> ;
>>
>> OBJECT_NUM
>> : '#' '-'? DIGIT+
>> ;
>>
>> fragment IDENT_FIRST_CHAR
>> : 'a'..'z' | '_'
>> ;
>>
>> fragment IDENT_CHAR
>> : IDENT_FIRST_CHAR | DIGIT
>> ;
>>
>> IDENTIFIER
>> : IDENT_FIRST_CHAR IDENT_CHAR*
>> ;
>>
>> // Newlines aren't technically legal in MOO strings, where most code
>> comes // from. However, permitting them in the language means source
>> from files can // be compiled without stripping newlines in advance.
>> WHITESPACE
>> : (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; }
>> ;
>>
>> COMMENT
>> : '/*' .* '*/' { $channel = HIDDEN; }
>> ;
>>
>> /* eof */
>> ---- for-range.moo ----
>> for x in [1..5]
>> endfor
>>
>> /* This should parse identically to the above. */ for x in [1 .. 5]
>> endfor
>>
>> for y in [6..10]
>> "Non-empty.";
>> endfor
>> /* eof */
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
More information about the antlr-interest
mailing list