[antlr-interest] first steps with a lexer/parser

Fri Jan 4 06:42:31 PST 2008

Hi -

a) WS and NL should get a marker
   { $channel = HIDDEN; }
so that the parser does not even see them - because I'm quite sure that also
   {a=1}
etc. (see first mail) should be allowed.
And then you can remove all references to WS and NL from the parser! - the language should then look much more like your language definition (wherever you got this from - if you invent(ed) the language yourself, still write a natural-language specification for the language, where you steal as much as possible from well-done other language specifications; Java and C# are the best ones for programming languages around - for other languages, I'd at least try to steal ideas from the whitespace and comment parts of those. And while I'm on the subject: If you design a language, always allow the possibility of some sort of comments; you will even need them in your tests).

b) In the rule 
    start  :    msg NL? EOF ;
put an ! behind EOF: You dont want this in the AST (unfortunately, it becomes a null Token - see the end of your output, which creates troubles off and on; and you get an artificial null root also - both are ugly).
(and remove the NL? - see a)).

c) You do a "double job" in the STR rules:

> STR
[...]
>        :    '"' ANYCHAR* '"'
>        ;
> 
> fragment ANYCHAR
>        :    (~'"')+
>        ;

There is a + in ANYCHAR, and a * in STR. What you want is simply either

 STR
[...]
        :    '"' (~'"')* '"'
        ;

or, if you want to keep this ANYCHAR rule,

STR
[...]
       :    '"' ANYCHAR* '"'
       ;

fragment ANYCHAR
       :    ~'"'         // without +
       ;

d) You might also want to capture tabs ('\t') in your WS rule.

Regards
Harald

-------- Original-Nachricht --------
> Datum: Fri, 4 Jan 2008 09:25:28 -0500
> Von: body <antlr-list at splitbody.com>
> An: "Harald Mueller" <harald_m_mueller at gmx.de>
> CC: antlr-interest at antlr.org
> Betreff: Re: [antlr-interest] first steps with a lexer/parser

> i see!
> 
> thank you for your patience, below is the modified lexer/parser.
> 
> so for the input string
> 
> { a=1 b="2" c="t" d="text" e="one two" f={ g="three four" h={ i=5 j="a ha"
> } } }
> 
> it produces
> 
> (MSG (PAIR a 1) (PAIR b 2) (PAIR c t) (PAIR d text) (PAIR e one two)
> (PAIR f (MSG (PAIR g three four) (PAIR h (MSG (PAIR i 5) (PAIR j a
> ha)))))) null
> 
> so now i just have to write the tree grammar to walk it and take
> appropriate action, correct?
> 
> thanks again for your help.
> 
> ----------------------------
> 
> grammar MsgString;
> 
> options { output = AST; }
> 
> tokens {
> 	PAIR;
> 	MSG;
> }
> 
> start  :    msg NL? EOF ;
> 
> msg    :    '{' WS nameValuePairExpr* WS '}' -> ^(MSG nameValuePairExpr*)
> ;
> 
> nameValuePairExpr
>        :    NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;
> 
> valueExpr
>        :    STR
>        |    INT
>        |    msg
>        ;
> 
> STR
>        @after{
>             setText(getText().substring(1, getText().length()-1));
>        }
>        :    '"' ANYCHAR* '"'
>        ;
> 
> fragment ANYCHAR
>        :    (~'"')+
>        ;
> 
> INT    :    '0'..'9'+ ;
> 
> NAME   :    ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
> 
> WS     :    ' '+ ;
> 
> NL     :    ('\n'|'\r')+ ;
> 
> ----------------------------
> 
> On 1/4/08, Harald Mueller <harald_m_mueller at gmx.de> wrote:
> > > a). it is indeed simpler if i use tokens instead of rules, but then i
> > > cannot strip the double quotes (! don't work unlike in the case of
> > > rules), and getting rid of them explicitly in code seems to be
> > > terribly hacky.
> >
> > No. The correct way is to normalize the token text in the lexer.
> Everything else is considered hacky in lexer+parser design.
> > (Yes, there is a bug in ANTLR 3.x, as far as I know, so that ! does not
> work in the lexer right now. Terence promised to work on this somewhen
> "now" - please complain about this!).
> >
> > >
> > > b). i could not simply skip() WS, because then they get removed from
> > > my strings within the quotes (and i want spaces preserved inside
> > > quotes).
> >
> > If this is the only reason for keeping the WS, it shows even more that
> the decision to do string assembly on the parser level is wrong. Please
> don't do this. One simple line in the lext
> >
> >     $text = $text.substring(1,....);
> >
> > or a repaired ANTLR with two tiny !
> >
> >     STRING : '"'! ~('"')* '"'!
> >
> > as opposed to thinking about WS in the grammar at multiple places, where
> it is (by language definition - at least I assume this) irrelevant: Please
> go for the time-proven, text-book decision.
> >
> > > or perhaps some sort of a flag that says that if i am inside a
> > > quoted string i do not throw away spaces.
> >
> > If at all, you can re-create the original text from the HIDDEN channel -
> there, all the characters are preserved.
> >
> > > d). i guess similar to a). i prefer semantic rather than symbolic...
> > > err.. symbols
> >
> > Yeah - here it is perfectly ok to use a sensible name instead of '='.
> >
> > Regards
> > Harald
> >

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer