[antlr-interest] ambigous lexer tokens
Wincent Colaiuta
win at wincent.com
Thu Jun 28 03:20:28 PDT 2007
El 28/6/2007, a las 11:54, Torsten Curdt escribió:
> On 28.06.2007, at 11:13, Wincent Colaiuta wrote:
>
>> El 27/6/2007, a las 22:44, Torsten Curdt escribió:
>>
>>> I would like to write a grammar for the following output:
>>>
>>> drwxr-xr-x 23 tcurdt tcurdt 782 Jun 24 22:54 ..
>>> -rw-r--r-- 1 tcurdt tcurdt 18545 Nov 1 2006
>>> ASMContentHandler.Rule.html
>>>
>>> Of course that means that the tokens (TYPE/MODS/INT/NAME/HOUR/
>>> YEAR) for the lexer are ambiguous.
>>> How should such a grammar look like? Pointers?
>>
>> I think you have a number of options:
>>
>> 1. Given that many of the tokens look the same, don't try to
>> differentiate between them in the lexer. Instead handle everything
>> in the parser.
>
> OK
>
>> 2. Use predicates in the lexer to turn alternatives on and off
>> depending on which "column" you're in (ie. make a context-
>> sensitive lexer).
>
> Could you give an example how that would look like?
Well, here's one (untested) idea: one way might be to modify your WS
rule to increment a "column" counter whenever a run of spaces is
seen; you'd have to set up the column counter in your @lexer::members
section (exactly how you set up and initialize that variable is
dependent on your target language):
WS : ' '+ { column++; };
And then modify your NEWLINE rule to reset the column counter:
NEWLINE : '\r'? '\n' { column = 0; };
Now you can prefix your rules with gated semantic predicates,
effectively turning them on/off depending on the input column; for
example, you only want your INT rule to be applied in columns 4 and 10:
INT : { column == 4 || column == 10 }?=> '0'..'9'+ ;
And so on... Obviously if columns are whitespace delimited you need
to roll your "TYPE" and "MODS" rules into one, and also remember that
your final column (the file name) may actually contain whitespace so
to scan filenames you probably want a rule like:
FILENAME : { column > 8 }?=> ~('\n' | '\r')+ ;
Or alternatively, make your WS rule only apply in the leftmost
columns and apply your FILENAME rule in column 9 only:
WS : { column < 9 }?=> ' '+ { column++; };
FILENAME : { column == 9 }?=> ~('\n' | '\r')+ ;
So I think this could be made to work (although not sure how you'd
handle filenames with embedded newlines), but it starts to look
pretty complex (look at the source code for the generated lexer), and
in that case it seems easier/simpler to just write a simple parser by
hand...
Cheers,
Wincent
More information about the antlr-interest
mailing list