[antlr-interest] Unix-like parameters grammar

Thu Jan 24 15:41:38 PST 2008

Cristian Peraferrer wrote:
> I am trying to build a grammar to parse unix-like parameters, but I'm  
> having problems with the FileName token.

Depending on how unix-like you want to be, you may need to
think about how this actually works in unix.

When a program gets launched as a result of a shell command,
two things happen:

1. The shell lexes the command line into a sequence of tokens
    that become the program name and its list of argument strings.

2. The program looks at its sequence of argument strings and
    decides which ones represent options, or groups of options,
    or things-that-aren't-options (often, but hardly always,
    filenames).

(1) happens without any awareness of options, filenames, or
special meanings for -. All it does is tokenize a string
by splitting at all occurrences of non-quoted whitespace.
On the other hand, it is aware of all the shell's styles
of quotation, so that you could give a sequence of commands
(in sh, or ksh, or bash) like

     touch "My File"
     "ls" My\ File
     \l\s '-l'"t" My' 'File
     r'm' \-f M"y F"ile

where the first and second lines have two tokens each, and the
third and fourth have three each. (I assume you have no need to
duplicate the shell's grammar of control structures or its
variable substitution features, so all you need is a lexer that
knows the quoting rules.)

(2) happens without any awareness of the shell's quoting rules; those
have all been applied. It's also no longer lexing from a string;
its input is now the sequence of strings resulting from the
shell's tokenizing, and the shell's quoting characters have served
their purpose and are now gone. For example, the final token in
each of the above commands is the seven-character string that
starts with M and ends with e.

It's in (2) that the program itself checks for (already tokenized)
strings that begin with - and treats them as options, or treats
the exact string - as representing stdin.  Your question about how
to tokenize FileName probably stems from thinking that you need to
tokenize it. You don't; it's anything that was passed to you by
the shell as a single string and that you don't think is an option.
Unix will happily create a file with any string for a name, as
long as it doesn't contain \x00 anywhere or \x2f except in
reasonable places - those are the only two restrictions (though a
Unix system can mount different types of filesystem that may impose
further restrictions).

If you are writing a genuine Unix command-line program, of course
you rely on the genuine shell to do (1) and your program is only
concerned with (2).  If you are setting out to replicate the
behavior in a different environment, it probably lends itself
best to a kind of staged approach with a lexer that simply
understands quoting and returns a stream of a single token
type ARGSTRING, then parhaps a token-filter stage that could
replace an ARGSTRING with one or more option tokens if it starts
with - (and no -- has been seen), and then a parser on the
end of that. That might be the most natural high-level ANTLRish
picture of what's really going on--though of course most Unix
programs use much more lightweight library functions for the purpose.

-Chap