[antlr-interest] Newbie problem with line-oriented parsing

Mon Feb 15 12:13:38 PST 2010

You can't use .* in the lexer, only .  The . rule should be the last one in the lexer and is just used to catch any character tha you have not otherwise matched (usually indicates a spurious character).

Make sure that your lexer rules are not ambiguous - they must not overlap :-)

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Crocker Ron-QA1007
> Sent: Monday, February 15, 2010 12:05 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Newbie problem with line-oriented parsing
> 
> Hi all -
> 
> I'm new here, so be nice to me. Further, let me start by apologizing
> for
> such a verbose first message.
> I have started porting a DSL, one that I've been supporting for 15+
> years, from lex/yacc based toolset (via a tool called MetaTool) to
> ANTLR.
> 
> I've been looking through the various materials available on the net
> and
> have a copy of The Definitive ANTLR Reference. As I started porting the
> grammar (EBNF ish) I've run into something I don't know how to deal
> with. Unfortunately I need to drag everyone through some background to
> get to the question, however I can start with the grammar I'm
> struggling
> with and the immediate problem.
> 
> <><><><> cut here - flowgen.g <><><><>
> grammar flowgen;
> 
> options {
>   language = Java;
> }
> 
> /* *********** */
> /* TRANSACTION */
> /* *********** */
> transaction:  ( ((KEY_START DEFINE_k) => xdefine*) tran_name message+
> );
> 
> xdefine: KEY_START DEFINE_k ID_name NEW_LINE;
> 
> tran_name: ~(KEY_START|NP_START|NEWLINE_) .* NEW_LINE;
> 
> message:  num1? from_name num2? to_name ((~(NP_START|WHITE|NEWLINE_))
> =>
> msg_name?) NEW_LINE;
> 
> num1: FLOATnumber;
> num2: FLOATnumber;
> 
> from_name: COLUMN_name;
> to_name:   COLUMN_name;
> 
> msg_name: MSG_name;
> 
> // Tokens - keywords
> DEFINE_k:       'DEFINE';
> 
> // Tokens - operators
> fragment KEY_START: '$';
> fragment NP_START:  '%';
> NEW_LINE: NEWLINE_;
> 
> // Tokens - names and numbers
> fragment NUMBER:    '0'..'9';
> fragment UPPERCASE: 'A'..'Z';
> fragment VARBASE:   UPPERCASE (UPPERCASE|NUMBER|'_')*;
> fragment VARNAME:   '$' VARBASE;
> fragment WHITE:     ' '|'\t';
> fragment NEWLINE_:  '\n'|'\r';
> 
> FLOATnumber: NUMBER+ ('.' NUMBER+)?;
> 
> ID_name:  VARBASE;
> VAR_name: VARNAME;
> 
> COLUMN_name: ( (ALPHA|NUMBER) (ALPHA|NUMBER|'_'|'&'|'-')*
>              | VARNAME
>              );
> //  name:   <([A-Za-z0-9][A-Za-z0-9_&-]*)?(\$[A-Z][A-Z0-9_]*)*>
> 
> WS:	(WHITE|NEWLINE_)+ {skip();};
> NON_PRINTING_COMMENT: NP_START .* NEWLINE_ {skip();};
> 
> MSG_name:  .*;
> <><><><> end <><><><>
> 
> When I run this through antlr I get the following error:
> Grammar: src/flowgen.g
> error(201): src/flowgen.g:57:12: The following alternatives can never
> be
> matched: 1
>  |---> MSG_name:  .*;
> 
> 1 error
> 
> BUILD FAIL
> (this is compliments of antlrv3ide plugin for eclipse; similar results
> occur with ANTLRworks)
> 
> ************ BEGIN BACKGROUND ************
> This language, flowgen, is used to specify Message Sequence Charts. We
> could be using ITU Z.120 for this, but since our local DSL predates
> Z.120 we have some interest in maintaining this language. The flowgen
> language is a simplified version of Z.120 in that the input language is
> simple and direct, and using the flowgen tools one can create the
> corresponding picture (and even the corresponding Z.120 input). [After
> re-reading that, I'm not sure the background helps OTHER than to note
> that it's an old DSL and there is a solid user base not interested in
> moving to another DSL that is overly-complicated for the particular job
> at hand.]
> 
> The format of a flowgen input file is simple: The first non-commented
> line is the "title" of the flow, and all subsequent lines represent
> messages in the flow. Newline's separate the constructs.
> 
> Here is an example flowgen input file:
> 
> 	 1. % Here is a comment
> 	 2. Simple flowgen flow
> 	 3. % Show a message going from A to B to C and back.
> 	 4. A	B	Message 1
> 	 5. # This is the first message in the sequence
> 	 6. B	C	Message 2
> 	 7. # This is the next message
> 	 8. C	B
> 	 9. % Note how the above message has no message name
> 	10. B	A	End
> 
> And this is the output of "classic" flowgen.
> 
> Simple flowgen flow    Page: 1
> 
>            A              B              C
>            |              |              |
>            | [1] Message 1|              |
>            |o------------>|              |
>            |              |              |
>            | This is the first message in the sequence
>            |              |              |
>            |              | [2] Message 2|
>            |              |o------------>|
>            |              |              |
>            |              | This is the next message
>            |              |              |
>            |              | [3]          |
>            |              |<------------o|
>            |              |              |
>            | [4] End      |              |
>            |<------------o|              |
>            |              |              |
> 
> Some notes:
> Lines 1 and 9 are "comment" lines and are ignored.
> 
> In this language, there are several constructs that map well to
> grammar-based solutions.
> * A title is the text associated with the first non-commented line
> * A message is the pair (arrow,comment) where an arrow represents the
> message moving from one place to another and a comment is optional text
> used to describe something about the message.
> * An arrow is the triple (from,to,message_text) where from and to are
> required and represent column names (equivalent to IDs in other
> pedagogic grammars), while message_text is optional and represents the
> "name" of the message.
> * A note is associated with an arrow and is a multi-line construct.
> Each
> of these lines begins with any number of '#' characters, but it is only
> the text after the '#'s that comprise the note.
> * A comment starts with the % character and continues to the end of the
> line [akin to the C++/Java '//' operator]
> * Blank lines are ignored, independent of context.
> 
> ************ END BACKGROUND ************
> 
> Given this understanding, I created the grammar above. I'm not sure a)
> what to do about the error, but more importantly, b) I'm much more
> concerned about HOW to convince an ANTLR grammar to do what I want it
> to
> do. In comparison with the prior toolset, the LL vs. LR question
> doesn't
> bother me. However, the way MetaTool handled restrictions on the
> lexical
> space was to take advantage of lex's "start states". The flowgen
> grammar
> has become so complicated [I've only given a snapshot; it is much more
> substantial] that we've broken lex and are about to break flex. Similar
> problem with yacc/bison, hence the desire to migrate to something more
> robust.
> 
> Thanks for hearing me out and I look forward to your
> recommendations/suggestions.
> 
> Ron Crocker
> Fellow of the Technical Staff
> Motorola
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address