[antlr-interest] ambigous lexer tokens

Torsten Curdt tcurdt at vafer.org
Thu Jun 28 02:54:20 PDT 2007


On 28.06.2007, at 11:13, Wincent Colaiuta wrote:

> El 27/6/2007, a las 22:44, Torsten Curdt escribió:
>
>> I would like to write a grammar for the following output:
>>
>>  drwxr-xr-x   23 tcurdt  tcurdt    782 Jun 24 22:54 ..
>>  -rw-r--r--    1 tcurdt  tcurdt  18545 Nov  1  2006  
>> ASMContentHandler.Rule.html
>>
>> Of course that means that the tokens (TYPE/MODS/INT/NAME/HOUR/ 
>> YEAR) for the lexer are ambiguous.
>> How should such a grammar look like? Pointers?
>
> I think you have a number of options:
>
> 1. Given that many of the tokens look the same, don't try to  
> differentiate between them in the lexer. Instead handle everything  
> in the parser.

OK

> 2. Use predicates in the lexer to turn alternatives on and off  
> depending on which "column" you're in (ie. make a context-sensitive  
> lexer).

Could you give an example how that would look like?

> 3. Don't use ANTLR for this task. The input is so limited and  
> regular that it may be quicker to just write something by hand.

Was tempted as it should be easy to do with just a regular  
expression. But I wanted to see if antlr would be suitable for it too.

> I personally would go with "3" in this case because I think you are  
> much more likely to come up with a correct parser by hand; ANTLR is  
> a very complex tool and it can deviate from your expectations in  
> incredibly subtle and hard-to-understand ways.

I used v2 before ...but in that case the lexing was much more obvious.

Thanks a lot for your input.

cheers
--
Torsten




More information about the antlr-interest mailing list