[antlr-interest] Newbie problem with line-oriented parsing
Jim Idle
jimi at temporal-wave.com
Mon Feb 15 12:13:38 PST 2010
You can't use .* in the lexer, only . The . rule should be the last one in the lexer and is just used to catch any character tha you have not otherwise matched (usually indicates a spurious character).
Make sure that your lexer rules are not ambiguous - they must not overlap :-)
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Crocker Ron-QA1007
> Sent: Monday, February 15, 2010 12:05 PM
> To: antlr-interest at antlr.org
> Subject: [antlr-interest] Newbie problem with line-oriented parsing
>
> Hi all -
>
> I'm new here, so be nice to me. Further, let me start by apologizing
> for
> such a verbose first message.
> I have started porting a DSL, one that I've been supporting for 15+
> years, from lex/yacc based toolset (via a tool called MetaTool) to
> ANTLR.
>
> I've been looking through the various materials available on the net
> and
> have a copy of The Definitive ANTLR Reference. As I started porting the
> grammar (EBNF ish) I've run into something I don't know how to deal
> with. Unfortunately I need to drag everyone through some background to
> get to the question, however I can start with the grammar I'm
> struggling
> with and the immediate problem.
>
> <><><><> cut here - flowgen.g <><><><>
> grammar flowgen;
>
> options {
> language = Java;
> }
>
> /* *********** */
> /* TRANSACTION */
> /* *********** */
> transaction: ( ((KEY_START DEFINE_k) => xdefine*) tran_name message+
> );
>
> xdefine: KEY_START DEFINE_k ID_name NEW_LINE;
>
> tran_name: ~(KEY_START|NP_START|NEWLINE_) .* NEW_LINE;
>
> message: num1? from_name num2? to_name ((~(NP_START|WHITE|NEWLINE_))
> =>
> msg_name?) NEW_LINE;
>
> num1: FLOATnumber;
> num2: FLOATnumber;
>
> from_name: COLUMN_name;
> to_name: COLUMN_name;
>
> msg_name: MSG_name;
>
> // Tokens - keywords
> DEFINE_k: 'DEFINE';
>
> // Tokens - operators
> fragment KEY_START: '$';
> fragment NP_START: '%';
> NEW_LINE: NEWLINE_;
>
> // Tokens - names and numbers
> fragment NUMBER: '0'..'9';
> fragment UPPERCASE: 'A'..'Z';
> fragment VARBASE: UPPERCASE (UPPERCASE|NUMBER|'_')*;
> fragment VARNAME: '$' VARBASE;
> fragment WHITE: ' '|'\t';
> fragment NEWLINE_: '\n'|'\r';
>
> FLOATnumber: NUMBER+ ('.' NUMBER+)?;
>
> ID_name: VARBASE;
> VAR_name: VARNAME;
>
> COLUMN_name: ( (ALPHA|NUMBER) (ALPHA|NUMBER|'_'|'&'|'-')*
> | VARNAME
> );
> // name: <([A-Za-z0-9][A-Za-z0-9_&-]*)?(\$[A-Z][A-Z0-9_]*)*>
>
> WS: (WHITE|NEWLINE_)+ {skip();};
> NON_PRINTING_COMMENT: NP_START .* NEWLINE_ {skip();};
>
> MSG_name: .*;
> <><><><> end <><><><>
>
> When I run this through antlr I get the following error:
> Grammar: src/flowgen.g
> error(201): src/flowgen.g:57:12: The following alternatives can never
> be
> matched: 1
> |---> MSG_name: .*;
>
> 1 error
>
> BUILD FAIL
> (this is compliments of antlrv3ide plugin for eclipse; similar results
> occur with ANTLRworks)
>
> ************ BEGIN BACKGROUND ************
> This language, flowgen, is used to specify Message Sequence Charts. We
> could be using ITU Z.120 for this, but since our local DSL predates
> Z.120 we have some interest in maintaining this language. The flowgen
> language is a simplified version of Z.120 in that the input language is
> simple and direct, and using the flowgen tools one can create the
> corresponding picture (and even the corresponding Z.120 input). [After
> re-reading that, I'm not sure the background helps OTHER than to note
> that it's an old DSL and there is a solid user base not interested in
> moving to another DSL that is overly-complicated for the particular job
> at hand.]
>
> The format of a flowgen input file is simple: The first non-commented
> line is the "title" of the flow, and all subsequent lines represent
> messages in the flow. Newline's separate the constructs.
>
> Here is an example flowgen input file:
>
> 1. % Here is a comment
> 2. Simple flowgen flow
> 3. % Show a message going from A to B to C and back.
> 4. A B Message 1
> 5. # This is the first message in the sequence
> 6. B C Message 2
> 7. # This is the next message
> 8. C B
> 9. % Note how the above message has no message name
> 10. B A End
>
> And this is the output of "classic" flowgen.
>
> Simple flowgen flow Page: 1
>
> A B C
> | | |
> | [1] Message 1| |
> |o------------>| |
> | | |
> | This is the first message in the sequence
> | | |
> | | [2] Message 2|
> | |o------------>|
> | | |
> | | This is the next message
> | | |
> | | [3] |
> | |<------------o|
> | | |
> | [4] End | |
> |<------------o| |
> | | |
>
> Some notes:
> Lines 1 and 9 are "comment" lines and are ignored.
>
> In this language, there are several constructs that map well to
> grammar-based solutions.
> * A title is the text associated with the first non-commented line
> * A message is the pair (arrow,comment) where an arrow represents the
> message moving from one place to another and a comment is optional text
> used to describe something about the message.
> * An arrow is the triple (from,to,message_text) where from and to are
> required and represent column names (equivalent to IDs in other
> pedagogic grammars), while message_text is optional and represents the
> "name" of the message.
> * A note is associated with an arrow and is a multi-line construct.
> Each
> of these lines begins with any number of '#' characters, but it is only
> the text after the '#'s that comprise the note.
> * A comment starts with the % character and continues to the end of the
> line [akin to the C++/Java '//' operator]
> * Blank lines are ignored, independent of context.
>
> ************ END BACKGROUND ************
>
> Given this understanding, I created the grammar above. I'm not sure a)
> what to do about the error, but more importantly, b) I'm much more
> concerned about HOW to convince an ANTLR grammar to do what I want it
> to
> do. In comparison with the prior toolset, the LL vs. LR question
> doesn't
> bother me. However, the way MetaTool handled restrictions on the
> lexical
> space was to take advantage of lex's "start states". The flowgen
> grammar
> has become so complicated [I've only given a snapshot; it is much more
> substantial] that we've broken lex and are about to break flex. Similar
> problem with yacc/bison, hence the desire to migrate to something more
> robust.
>
> Thanks for hearing me out and I look forward to your
> recommendations/suggestions.
>
> Ron Crocker
> Fellow of the Technical Staff
> Motorola
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
More information about the antlr-interest
mailing list