[antlr-interest] ambigous lexer tokens

Thu Jun 28 02:13:01 PDT 2007

El 27/6/2007, a las 22:44, Torsten Curdt escribió:

> I would like to write a grammar for the following output:
>
>  drwxr-xr-x   23 tcurdt  tcurdt    782 Jun 24 22:54 ..
>  -rw-r--r--    1 tcurdt  tcurdt  18545 Nov  1  2006  
> ASMContentHandler.Rule.html
>
> Of course that means that the tokens (TYPE/MODS/INT/NAME/HOUR/YEAR)  
> for the lexer are ambiguous.
> How should such a grammar look like? Pointers?

I think you have a number of options:

1. Given that many of the tokens look the same, don't try to  
differentiate between them in the lexer. Instead handle everything in  
the parser.

2. Use predicates in the lexer to turn alternatives on and off  
depending on which "column" you're in (ie. make a context-sensitive  
lexer).

3. Don't use ANTLR for this task. The input is so limited and regular  
that it may be quicker to just write something by hand.

I personally would go with "3" in this case because I think you are  
much more likely to come up with a correct parser by hand; ANTLR is a  
very complex tool and it can deviate from your expectations in  
incredibly subtle and hard-to-understand ways.

Cheers,
Wincent