[antlr-interest] ambigous lexer tokens

Thu Jun 28 03:20:28 PDT 2007

El 28/6/2007, a las 11:54, Torsten Curdt escribió:

> On 28.06.2007, at 11:13, Wincent Colaiuta wrote:
>
>> El 27/6/2007, a las 22:44, Torsten Curdt escribió:
>>
>>> I would like to write a grammar for the following output:
>>>
>>>  drwxr-xr-x   23 tcurdt  tcurdt    782 Jun 24 22:54 ..
>>>  -rw-r--r--    1 tcurdt  tcurdt  18545 Nov  1  2006  
>>> ASMContentHandler.Rule.html
>>>
>>> Of course that means that the tokens (TYPE/MODS/INT/NAME/HOUR/ 
>>> YEAR) for the lexer are ambiguous.
>>> How should such a grammar look like? Pointers?
>>
>> I think you have a number of options:
>>
>> 1. Given that many of the tokens look the same, don't try to  
>> differentiate between them in the lexer. Instead handle everything  
>> in the parser.
>
> OK
>
>> 2. Use predicates in the lexer to turn alternatives on and off  
>> depending on which "column" you're in (ie. make a context- 
>> sensitive lexer).
>
> Could you give an example how that would look like?

Well, here's one (untested) idea: one way might be to modify your WS  
rule to increment a "column" counter whenever a run of spaces is  
seen; you'd have to set up the column counter in your @lexer::members  
section (exactly how you set up and initialize that variable is  
dependent on your target language):

   WS : ' '+ { column++; };

And then modify your NEWLINE rule to reset the column counter:

   NEWLINE : '\r'? '\n' { column = 0; };

Now you can prefix your rules with gated semantic predicates,  
effectively turning them on/off depending on the input column; for  
example, you only want your INT rule to be applied in columns 4 and 10:

   INT : { column == 4 || column == 10 }?=> '0'..'9'+ ;

And so on... Obviously if columns are whitespace delimited you need  
to roll your "TYPE" and "MODS" rules into one, and also remember that  
your final column (the file name) may actually contain whitespace so  
to scan filenames you probably want a rule like:

   FILENAME : { column > 8 }?=> ~('\n' | '\r')+ ;

Or alternatively, make your WS rule only apply in the leftmost  
columns and apply your FILENAME rule in column 9 only:

   WS : { column < 9 }?=> ' '+ { column++; };
   FILENAME : { column == 9 }?=> ~('\n' | '\r')+ ;

So I think this could be made to work (although not sure how you'd  
handle filenames with embedded newlines), but it starts to look  
pretty complex (look at the source code for the generated lexer), and  
in that case it seems easier/simpler to just write a simple parser by  
hand...

Cheers,
Wincent