[antlr-interest] Newbie problem with line-oriented parsing

Mon Mar 8 11:48:32 PST 2010

Thanks for the help.  

Here's my newest grammar for flowgen, meeting the intent of what I
posted before but handling MOST of the previous flowgen language. I
still have some work to do on expressions and others, but it's getting
there... Running in ANTLRworks, I can see parse trees and debug, and it
will parse an entire flowgen file that was supported by the previous
tool.

There are a couple of things I don't like in my ANTLR grammar, and I
could use your help:

Consider this input:
  <><> cut here <><>
  $DEFINE SDL

  HSS_requested_PDN_discon_ISD_Success_F15102
  $DEF FOO
  <><> end cut <><>

Here's a parse tree from this grammar for that input.
  transaction
  - xdefine
  - tran_name
    - word: HSS_requested_PDN_discon_ISD_Success_F
    - word
      - float_number: 15102
    - \n
  - tran_item
    - xdefine_inline
      - $DEF
      - FOO
      - rest_of_line: \n

A) I would like word to treat
"HSS_requested_PDN_discon_ISD_Success_F15102" as one word, not 2. 
B) I don't like rest_of_line as it contains too many parts if a line
contains something like "_____". The current rule returns 5 things, one
for each _. If that same thing said "FRED", it would return 1 thing. I'm
sure it's a token definition that it doesn't like, I just don't know how
to fix it.

It's possible these two things are the same problem.

Here's the grammar file. I'm running in both eclipse and ANTLRworks, but
the output above came from the ANTLRworks debug environment. I did a
little manual editing to give it some structure.

Ron
<><><><> cut here - flowgen.g <><><><>
grammar flowgen;

options {
  language = Java;
}

WS:	(WHITE)+ {skip();};

/* *********** */
/* TRANSACTION */
/* *********** */
//
// A transaction file is the main kind of flowgen file. It is the one
// that results in arrows on paper.
//
// A transaction file comprises a title and at least one message.
Further,
// a transaction file can begin with a set of variable definitions that
// are valid while that file is "in context." (* more on that later)
//
transaction:  xdefine* tran_name tran_item+ ;

//===
// DEFINE handling
// - There are 3 kinds of define actions:
//   a) Defines that happen at the beginning of the file
//   b) Defines that occur DURING the message flow.
//   c) Undefines can occur with $INCLUDE statements
//===
xdefine:  DEFINE_k COL_NAME rest_of_line;
xdefine_inline: DEF_k COL_NAME rest_of_line;
undef_inline: UNDEF_k COL_NAME rest_of_line;

defval: COL_NAME;
undefid: COL_NAME;

//===
// TRANSACTION ITEM handling
// A transaction item is the principal structure of a transaction.
//===

tran_item:  ( mn_pair 
    | if_stmt 
    | free_comment 
    | inc_stmt 
//    | par_block
    | opt_msg 
//    | opt_block
    | xdefine_inline
    | undef_inline
    | box_msg
//    | twoway_msg
//    | oneway_msg
    | return_stmt
    );

//===
// IF handling
//===
if_stmt:  ifpart tran_item+ elseifpart* elsepart? ENDIF_k;

ifpart:   ifkind new_line;
ifkind:   IFDEF_k id
        | IFNDEF_k id
        | IF_k expr
        ;

elseifpart: elseifkind new_line tran_item+;
elseifkind: ELIFDEF_k id
          | ELIFNDEF_k id
          | ELIF_k expr
          ;
elsepart: ELSE_k new_line tran_item+;

new_line:	NEW_LINE;
id	:	word;
expr	:	word;
//===
// OPTIONAL handling
// - 2 kinds: optional message and optional blocks
//===
opt_msg:        OPTMSG_k optmsg_content;
optmsg_content: box_msg 
              | mn_pair 
//              | twoway_msg 
//              | oneway_msg 
              ;

//===
// 
//===
//===
// RETURN handling
//===
return_stmt:  RETURN_k NEW_LINE;

//===
// Transaction Name. 
// TODO: Add notes for title page?
//===
tran_name: word+ NEW_LINE;

//===
// FREE COMMENT handling
//===
free_comment: FC_KEY COL_NAME? (Lbrack)=>fc_params? rest_of_line 
              noteline+;
fc_params:  Lbrack fc_par  Rbrack ;
fc_par:   inc_box;
//===
// BOX handling
//===
box_msg: BOX_k mn_pair;

//===
// INCLUDE handling
//===

inc_stmt: INCLUDE_k path inc_params? new_line ; 
/*
      revflag:INT = FALSE;
      nsubs:INT = 0;
      defs:SYMTAB_PTR = 0;
      ndefs:INT = 0;)
*/
inc_params: Lbrack inc_par (';' inc_par)* Rbrack;

inc_par:   inc_def 
         | inc_undef 
         | inc_sub 
         | inc_box 
         | inc_arrow 
         | inc_opt 
         ;
inc_def:   DEF_k defval (COMMA defval)*;
inc_undef: UNDEF_k undefid (COMMA undefid)*;
inc_sub:   old_name=COL_NAME EQUALS new_name=COL_NAME;
inc_box:   BOX_k;
inc_arrow: TWOWAY_k | ONEWAY_k;
inc_opt:   OPT_k | OPTMSG_k;

path: SLASH? relative_path ;

relative_path: path_word (SLASH path_word)*;

path_word: (up_directory)=>up_directory
         | word+
         ;

//===
// Message handling
// - A message is one of the building blocks of the flow. It describes
//   information flow from one point to another (or both directions).
// - There are 
//===

mn_pair:  message noteline*;
message:  num1? from_name num2? to_name rest_of_line;

noteline: NOTE_KEY rest_of_line;

rest_of_line: (options {greedy=false;}:.*) NEW_LINE;

num1: float_number;
num2: float_number;

float_number: INTnumber (DOT INTnumber)?;

from_name: COL_NAME;
to_name:   COL_NAME;

word: COL_NAME
    | float_number
    | randomPrintable+;

// Tokens - operators
fragment KEY_START:  '$';
fragment DOT_:       '.';
fragment STAR_:      '*';
fragment FC_START:   '&';
fragment NP_START:   '%';
fragment NOTE_START: '#';
fragment COMMA_:     ',';
fragment EQUALS_:    '=';
fragment SLASH_:     '/';
fragment L_BRACK_:   '[';
fragment R_BRACK_:   ']';
fragment UNDERBAR_:  '_';
fragment MINUS_:     '-';

Lbrack: L_BRACK_;
Rbrack: R_BRACK_;

up_directory: DOT DOT;	

DOT:      DOT_;
NOTE_KEY: NOTE_START;
FC_KEY:   FC_START;
COMMA:    COMMA_;
EQUALS:   EQUALS_;
SLASH:    SLASH_;
UNDERBAR:	UNDERBAR_;
MINUS	:	MINUS_;

// Tokens - keywords (hint: keywords end with '_k')
BLOCK_k:    '$BLOCK';
BOX_k:      '$BOX';

DEF_k:      '$DEF';
DEFINE_k:   '$DEFINE';

ELIF_k:     '$ELIF';
ELIFDEF_k:  '$ELIFDEF';
ELIFNDEF_k: '$ELIFNDEF';
ELSE_k:     '$ELSE';
ENDIF_k:    '$ENDIF';
ENDPAR_k:   '$ENDPAR';

IF_k:       '$IF';
IFDEF_k:    '$IFDEF';
IFNDEF_k:   '$IFNDEF';
INCLUDE_k:  '$INCLUDE';

ONEWAY_k:   '$1WAY';
OPT_k:      '$OPTIONAL';
OPTEND_k:   '$ENDOPTIONAL';
OPTMSG_k:   '$OPT';

PAR_k:      '$PAR';

RETURN_k:   '$RETURN';

TWOWAY_k:   '$2WAY';

UNDEF_k:    '$UNDEF';

// Tokens - names and numbers
fragment NUMBER:    '0'..'9';
fragment UPPERCASE: 'A'..'Z';
fragment LOWERCASE: 'a'..'z';
fragment PRINTABLE: ' '..'~';

NON_ALPHANUM: 	  '!'|'"'|'\''..')'|'+'..','  	// ! " ' ( ) + ,
		| ':'..'@'			// : ; < = > ? @
		| '\\'|'`'|'^'  		// [ \ ] ^ `
		| '{'..'~';  			// { | } ~
randomPrintable: 
            ( DOT | KEY_START | FC_START | NOTE_START | Lbrack | Rbrack
            | NON_ALPHANUM | SLASH | UNDERBAR | MINUS
            )+;

INTnumber: NUMBER+;
COL_NAME: (UPPERCASE|LOWERCASE) (UPPERCASE|LOWERCASE|UNDERBAR_|MINUS_)*;

// Tokens - Whitespace
fragment WHITE:     ' '|'\t';
fragment NEWLINE_:  '\r'| '\n';

NEW_LINE: NEWLINE_+; /* probably needs to include CR and CRLF */
NON_PRINTING_COMMENT: NP_START .* NEW_LINE {skip();};
<><><><> end <><><><>

-----Original Message-----
From: antlr-interest-bounces at antlr.org
[mailto:antlr-interest-bounces at antlr.org] On Behalf Of Jim Idle
Sent: Monday, February 15, 2010 2:14 PM
To: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Newbie problem with line-oriented parsing

You can't use .* in the lexer, only .  The . rule should be the last one
in the lexer and is just used to catch any character tha you have not
otherwise matched (usually indicates a spurious character).

Make sure that your lexer rules are not ambiguous - they must not
overlap :-)

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Crocker Ron-QA1007
> Sent: Monday, February 15, 2010 12:05 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Newbie problem with line-oriented parsing
> 
> Hi all -
> 
> I'm new here, so be nice to me. Further, let me start by apologizing
> for
> such a verbose first message.
> I have started porting a DSL, one that I've been supporting for 15+
> years, from lex/yacc based toolset (via a tool called MetaTool) to
> ANTLR.
> 
> I've been looking through the various materials available on the net
> and
> have a copy of The Definitive ANTLR Reference. As I started porting
the
> grammar (EBNF ish) I've run into something I don't know how to deal
> with. Unfortunately I need to drag everyone through some background to
> get to the question, however I can start with the grammar I'm
> struggling
> with and the immediate problem.
> 
> <><><><> cut here - flowgen.g <><><><>
> grammar flowgen;
> 
> options {
>   language = Java;
> }
> 
> /* *********** */
> /* TRANSACTION */
> /* *********** */
> transaction:  ( ((KEY_START DEFINE_k) => xdefine*) tran_name message+
> );
> 
> xdefine: KEY_START DEFINE_k ID_name NEW_LINE;
> 
> tran_name: ~(KEY_START|NP_START|NEWLINE_) .* NEW_LINE;
> 
> message:  num1? from_name num2? to_name ((~(NP_START|WHITE|NEWLINE_))
> =>
> msg_name?) NEW_LINE;
> 
> num1: FLOATnumber;
> num2: FLOATnumber;
> 
> from_name: COLUMN_name;
> to_name:   COLUMN_name;
> 
> msg_name: MSG_name;
> 
> // Tokens - keywords
> DEFINE_k:       'DEFINE';
> 
> // Tokens - operators
> fragment KEY_START: '$';
> fragment NP_START:  '%';
> NEW_LINE: NEWLINE_;
> 
> // Tokens - names and numbers
> fragment NUMBER:    '0'..'9';
> fragment UPPERCASE: 'A'..'Z';
> fragment VARBASE:   UPPERCASE (UPPERCASE|NUMBER|'_')*;
> fragment VARNAME:   '$' VARBASE;
> fragment WHITE:     ' '|'\t';
> fragment NEWLINE_:  '\n'|'\r';
> 
> FLOATnumber: NUMBER+ ('.' NUMBER+)?;
> 
> ID_name:  VARBASE;
> VAR_name: VARNAME;
> 
> COLUMN_name: ( (ALPHA|NUMBER) (ALPHA|NUMBER|'_'|'&'|'-')*
>              | VARNAME
>              );
> //  name:   <([A-Za-z0-9][A-Za-z0-9_&-]*)?(\$[A-Z][A-Z0-9_]*)*>
> 
> WS:	(WHITE|NEWLINE_)+ {skip();};
> NON_PRINTING_COMMENT: NP_START .* NEWLINE_ {skip();};
> 
> MSG_name:  .*;
> <><><><> end <><><><>
> 
> When I run this through antlr I get the following error:
> Grammar: src/flowgen.g
> error(201): src/flowgen.g:57:12: The following alternatives can never
> be
> matched: 1
>  |---> MSG_name:  .*;
> 
> 1 error
> 
> BUILD FAIL
> (this is compliments of antlrv3ide plugin for eclipse; similar results
> occur with ANTLRworks)
> 
> ************ BEGIN BACKGROUND ************
> This language, flowgen, is used to specify Message Sequence Charts. We
> could be using ITU Z.120 for this, but since our local DSL predates
> Z.120 we have some interest in maintaining this language. The flowgen
> language is a simplified version of Z.120 in that the input language
is
> simple and direct, and using the flowgen tools one can create the
> corresponding picture (and even the corresponding Z.120 input). [After
> re-reading that, I'm not sure the background helps OTHER than to note
> that it's an old DSL and there is a solid user base not interested in
> moving to another DSL that is overly-complicated for the particular
job
> at hand.]
> 
> The format of a flowgen input file is simple: The first non-commented
> line is the "title" of the flow, and all subsequent lines represent
> messages in the flow. Newline's separate the constructs.
> 
> Here is an example flowgen input file:
> 
> 	 1. % Here is a comment
> 	 2. Simple flowgen flow
> 	 3. % Show a message going from A to B to C and back.
> 	 4. A	B	Message 1
> 	 5. # This is the first message in the sequence
> 	 6. B	C	Message 2
> 	 7. # This is the next message
> 	 8. C	B
> 	 9. % Note how the above message has no message name
> 	10. B	A	End
> 
> And this is the output of "classic" flowgen.
> 
> Simple flowgen flow    Page: 1
> 
>            A              B              C
>            |              |              |
>            | [1] Message 1|              |
>            |o------------>|              |
>            |              |              |
>            | This is the first message in the sequence
>            |              |              |
>            |              | [2] Message 2|
>            |              |o------------>|
>            |              |              |
>            |              | This is the next message
>            |              |              |
>            |              | [3]          |
>            |              |<------------o|
>            |              |              |
>            | [4] End      |              |
>            |<------------o|              |
>            |              |              |
> 
> Some notes:
> Lines 1 and 9 are "comment" lines and are ignored.
> 
> In this language, there are several constructs that map well to
> grammar-based solutions.
> * A title is the text associated with the first non-commented line
> * A message is the pair (arrow,comment) where an arrow represents the
> message moving from one place to another and a comment is optional
text
> used to describe something about the message.
> * An arrow is the triple (from,to,message_text) where from and to are
> required and represent column names (equivalent to IDs in other
> pedagogic grammars), while message_text is optional and represents the
> "name" of the message.
> * A note is associated with an arrow and is a multi-line construct.
> Each
> of these lines begins with any number of '#' characters, but it is
only
> the text after the '#'s that comprise the note.
> * A comment starts with the % character and continues to the end of
the
> line [akin to the C++/Java '//' operator]
> * Blank lines are ignored, independent of context.
> 
> ************ END BACKGROUND ************
> 
> Given this understanding, I created the grammar above. I'm not sure a)
> what to do about the error, but more importantly, b) I'm much more
> concerned about HOW to convince an ANTLR grammar to do what I want it
> to
> do. In comparison with the prior toolset, the LL vs. LR question
> doesn't
> bother me. However, the way MetaTool handled restrictions on the
> lexical
> space was to take advantage of lex's "start states". The flowgen
> grammar
> has become so complicated [I've only given a snapshot; it is much more
> substantial] that we've broken lex and are about to break flex.
Similar
> problem with yacc/bison, hence the desire to migrate to something more
> robust.
> 
> Thanks for hearing me out and I look forward to your
> recommendations/suggestions.
> 
> Ron Crocker
> Fellow of the Technical Staff
> Motorola
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address