[antlr-interest] Grammar help

Mon Mar 15 20:54:31 PDT 2010

I am trying to create a grammar for a command language, and I'm stuck.  I'm
using ANTLR-3.1-2009-06-28 and libantlr3c-3.2.  The language is fairly
simplistic, where commands are of the form Verb Noun; however, some commands
can have a file name as part of the command (always the last item of the
command), and due to the wide range of possible characters in a file name,
ANTLR gets confused.  So, the question is, "How would I write a grammar that
will work?"

On Windows, a file name may contain any character except <,>,|,?,*,".  In
the grammar, if a file name has any spaces in it, then the entire name must
be enclosed within double-quotes (" "), and I don't want the WS (white space
token) to eat the white space within the quotes.  So, a file name may be a
quoted string (I'll strip off the quotes once I have the string) or an
unquoted string.  It would also be nice to be able to have LINE_COMMENTs on
the same line as a command with a file name, but that is not a requirement.

It occurred to me that instead of trying to build a token that overlaps with
pretty much every other token, that I could just grab everything from where
the file name starts on the line, to the end of the line, but I don't know
how to do that.

When I compile the grammar with ANTLR, I get the following:

warning(149): Commands.g:0:0: rewrite syntax or operator with no output
option; setting output=AST

warning(200): Commands.g:146:14: Decision can match input such as
"{'\u0000'..'\ f', '\u000E'..')', '+'..';', '=', '@'..'{', '}'..'\uFFFF'}"
using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input

warning(200): Commands.g:146:14: Decision can match input such as "'\r'"
using multiple alternatives: 1, 2

As a result, alternative(s) 2 were disabled for that input

error(208): Commands.g:151:1: The following token definitions can never be
matched because prior tokens match the same input: WS

ANTLR generates a lexer and a parser, but they don't do anything (any text
will be a match, even if it isn't in the defined token list).

Following is an abbreviated version of the grammar - the real grammar has a
lot more verbs and nouns - but this should give you the flavor of what I'm
trying to do.

//

// This grammar defines the commands available to the DiskTool (DT) program

//

grammar Commands;

options 

     {

     language = C;

     backtrack = true;

     memoize = true;

     }

@lexer::header

{

#define    ANTLR3_INLINE_INPUT_ASCII

}

//+

// Productions

//-

commands

     :

     (script_command

     | dump_command

     ! show_command

     )*;

script_command

     :  '@' 

     FILE_NAME       {printf ("File name [\%s]\n", $FILE_NAME);}

     ;

dump_command

     : DUMP

     (dump_struct

     | dump_block

     | a_file

     );

show_command

     : SHOW

     (structure_nouns

     | storage_nouns

     | a_file

     );

mbr_vbr

     : MBR 

     | VBR

     ;

block_nouns

     : LBN 

     | LCN 

     | VBN 

     | VCN

     ;

structure_nouns

     : MBR

     | VBR

     ;

dump_block

     : block_nouns

     number

     ((',' number)

     | (':' number))?

     DRIVE_NAME?

     ;

dump_struct

     : mbr_vbr

     ('/' qualifier)?

     DRIVE_NAME?

     ;

storage_nouns

     : DISK

     | VOLUME

     ;

a_file

     : FILE

     FILE_NAME       {printf ("File name [\%s]\n", $FILE_NAME);}

     ;

number

     : DEC_NUMBER 

     | HEX_NUMBER

     ;

qualifier

     : ALL

     ! CODE

     | TABLE

     ;

//+

// Tokens

//-

// Verbs

DUMP : 'DUMP';

SHOW : 'SHOW';

// Nouns

DISK : 'DISK';

FILE : 'FILE';

LBN  : 'LBN';

LCN  : 'LCN';

MBR  : 'MBR';

PBN  : 'PBN';

VBN  : 'VBN';

VBR  : 'VBR';

VCN  : 'VCN';

VOLUME     : 'VOLUME';

// Qualifiers

ALL  : 'ALL';

CODE : 'CODE';

TABLE : 'TABLE';

// Miscellaneous tokens

DRIVE_NAME : LETTER ':';

fragment

LETTER     : 'A'..'Z';

fragment

DIGIT : '0'..'9';

fragment

HEX_DIGIT  : (DIGIT | 'A'..'F');

HEX_NUMBER : '0X' HEX_DIGIT+;

DEC_NUMBER : DIGIT+;

FILE_NAME  :  ~('|' | '<' | '>' | '*' | '?')+ (('\r'? '\n') | EOF);

LINE_COMMENT

     : '!' ~('\n'|'\r')* (('\r'? '\n') | EOF) {$channel=HIDDEN;};

WS   : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};