[antlr-interest] Newbie problem with line-oriented parsing

Mon Feb 15 12:05:18 PST 2010

Hi all -

I'm new here, so be nice to me. Further, let me start by apologizing for
such a verbose first message. 
I have started porting a DSL, one that I've been supporting for 15+
years, from lex/yacc based toolset (via a tool called MetaTool) to
ANTLR.

I've been looking through the various materials available on the net and
have a copy of The Definitive ANTLR Reference. As I started porting the
grammar (EBNF ish) I've run into something I don't know how to deal
with. Unfortunately I need to drag everyone through some background to
get to the question, however I can start with the grammar I'm struggling
with and the immediate problem.

<><><><> cut here - flowgen.g <><><><>
grammar flowgen;

options {
  language = Java;
}

/* *********** */
/* TRANSACTION */
/* *********** */
transaction:  ( ((KEY_START DEFINE_k) => xdefine*) tran_name message+ );

xdefine: KEY_START DEFINE_k ID_name NEW_LINE;

tran_name: ~(KEY_START|NP_START|NEWLINE_) .* NEW_LINE;

message:  num1? from_name num2? to_name ((~(NP_START|WHITE|NEWLINE_)) =>
msg_name?) NEW_LINE;

num1: FLOATnumber;
num2: FLOATnumber;

from_name: COLUMN_name;
to_name:   COLUMN_name;

msg_name: MSG_name;

// Tokens - keywords
DEFINE_k:       'DEFINE';

// Tokens - operators
fragment KEY_START: '$';
fragment NP_START:  '%';
NEW_LINE: NEWLINE_; 

// Tokens - names and numbers
fragment NUMBER:    '0'..'9';
fragment UPPERCASE: 'A'..'Z';
fragment VARBASE:   UPPERCASE (UPPERCASE|NUMBER|'_')*;
fragment VARNAME:   '$' VARBASE;
fragment WHITE:     ' '|'\t';
fragment NEWLINE_:  '\n'|'\r';

FLOATnumber: NUMBER+ ('.' NUMBER+)?;

ID_name:  VARBASE;
VAR_name: VARNAME;

COLUMN_name: ( (ALPHA|NUMBER) (ALPHA|NUMBER|'_'|'&'|'-')*
             | VARNAME
             );
//  name:   <([A-Za-z0-9][A-Za-z0-9_&-]*)?(\$[A-Z][A-Z0-9_]*)*>

WS:	(WHITE|NEWLINE_)+ {skip();};
NON_PRINTING_COMMENT: NP_START .* NEWLINE_ {skip();};

MSG_name:  .*;
<><><><> end <><><><>

When I run this through antlr I get the following error:
Grammar: src/flowgen.g
error(201): src/flowgen.g:57:12: The following alternatives can never be
matched: 1
 |---> MSG_name:  .*;

1 error

BUILD FAIL
(this is compliments of antlrv3ide plugin for eclipse; similar results
occur with ANTLRworks)

************ BEGIN BACKGROUND ************
This language, flowgen, is used to specify Message Sequence Charts. We
could be using ITU Z.120 for this, but since our local DSL predates
Z.120 we have some interest in maintaining this language. The flowgen
language is a simplified version of Z.120 in that the input language is
simple and direct, and using the flowgen tools one can create the
corresponding picture (and even the corresponding Z.120 input). [After
re-reading that, I'm not sure the background helps OTHER than to note
that it's an old DSL and there is a solid user base not interested in
moving to another DSL that is overly-complicated for the particular job
at hand.]

The format of a flowgen input file is simple: The first non-commented
line is the "title" of the flow, and all subsequent lines represent
messages in the flow. Newline's separate the constructs. 

Here is an example flowgen input file:

	 1. % Here is a comment
	 2. Simple flowgen flow
	 3. % Show a message going from A to B to C and back.
	 4. A	B	Message 1
	 5. # This is the first message in the sequence
	 6. B	C	Message 2
	 7. # This is the next message
	 8. C	B
	 9. % Note how the above message has no message name
	10. B	A	End

And this is the output of "classic" flowgen.

Simple flowgen flow    Page: 1

           A              B              C
           |              |              |
           | [1] Message 1|              |
           |o------------>|              |
           |              |              |
           | This is the first message in the sequence
           |              |              |
           |              | [2] Message 2|
           |              |o------------>| 
           |              |              |
           |              | This is the next message
           |              |              |
           |              | [3]          |
           |              |<------------o| 
           |              |              |
           | [4] End      |              |
           |<------------o|              |
           |              |              |

Some notes:
Lines 1 and 9 are "comment" lines and are ignored.

In this language, there are several constructs that map well to
grammar-based solutions.
* A title is the text associated with the first non-commented line
* A message is the pair (arrow,comment) where an arrow represents the
message moving from one place to another and a comment is optional text
used to describe something about the message. 
* An arrow is the triple (from,to,message_text) where from and to are
required and represent column names (equivalent to IDs in other
pedagogic grammars), while message_text is optional and represents the
"name" of the message. 
* A note is associated with an arrow and is a multi-line construct. Each
of these lines begins with any number of '#' characters, but it is only
the text after the '#'s that comprise the note.
* A comment starts with the % character and continues to the end of the
line [akin to the C++/Java '//' operator]
* Blank lines are ignored, independent of context.

************ END BACKGROUND ************

Given this understanding, I created the grammar above. I'm not sure a)
what to do about the error, but more importantly, b) I'm much more
concerned about HOW to convince an ANTLR grammar to do what I want it to
do. In comparison with the prior toolset, the LL vs. LR question doesn't
bother me. However, the way MetaTool handled restrictions on the lexical
space was to take advantage of lex's "start states". The flowgen grammar
has become so complicated [I've only given a snapshot; it is much more
substantial] that we've broken lex and are about to break flex. Similar
problem with yacc/bison, hence the desire to migrate to something more
robust.

Thanks for hearing me out and I look forward to your
recommendations/suggestions.

Ron Crocker
Fellow of the Technical Staff
Motorola