[antlr-interest] ambigous lexer tokens

Randall R Schulz rschulz at sonic.net
Wed Jun 27 15:58:56 PDT 2007


On Wednesday 27 June 2007 13:44, Torsten Curdt wrote:
> I would like to write a grammar for the following output:
>
>   drwxr-xr-x   23 tcurdt  tcurdt    782 Jun 24 22:54 ..
>   -rw-r--r--    1 tcurdt  tcurdt  18545 Nov  1  2006
> ASMContentHandler.Rule.html
>
> My first naive try was
>
>   grammar test;
>
>   prog
>
> 	: (line)+ EOF
> 	;
>
>   line
> 	: TYPE MODS WS INT WS NAME WS NAME WS INT WS NAME WS (HOUR | YEAR)
>
> WS NAME NEWLINE
> 	;
>
>   TYPE
> 	: ['d' | '-' ]
> 	;

There are several other file types:

- plain file
d directory
p pipe (named pipe / FIFO)
s socket
l symbolic link
b block special (e.g., a disk or disk partition)
c character special (e.g., a (pseudo-) tty or serial port)


>   MODS
> 	: (['r' | 'w' | 'x' | '-' ]){9}
> 	;

You can strengthen the portions that recognize the modes by observing 
that they come in groups of three and that each position has either a 
permission character (if granted) or a dash (if not). The owner and 
group 'x' bits may be replaced by a capital S to indicate set user or 
set group ID, resp.

Keep in mind, too, that the last character has an extra value beyond the 
usual 'x' permission bit. Sticky executables (technically obsolescent) 
or directories are displayed with a 't' in place of their word execute 
bit.

On some systems that support ACLs, the presence of ACLs that don't fit 
the classic Unix model will cause a plus to be added to the mode 
string.


>   ...
>   NAME
> 	: ['0'..'9' | 'a'-'z' | 'A'..'Z' | '.' | '-']+
> 	;

Technically, on Unix (-like) systems, which this seems to be, the only 
character that may not be part of a file name is a NUL byte. Perhaps 
more to the point, you'll have to know about precisely how the "ls" 
command(s) you're dealing with present file names, especially those 
with non-ASCII or non-printing characters in their names, all of which 
are possible.


>   ...
>
> Of course that means that the tokens (TYPE/MODS/INT/NAME/HOUR/YEAR)
> for the lexer are ambiguous.
>
> How should such a grammar look like? Pointers?
>
> cheers
> --
> Torsten


I'm not sure what your overall goal is, but perhaps using the "getfacl" 
command, if available on your system, would present you with a more 
tractable format?


Good luck.


Randall Schulz


More information about the antlr-interest mailing list