[antlr-interest] first steps with a lexer/parser
Harald Mueller
harald_m_mueller at gmx.de
Fri Jan 4 06:42:31 PST 2008
Hi -
a) WS and NL should get a marker
{ $channel = HIDDEN; }
so that the parser does not even see them - because I'm quite sure that also
{a=1}
etc. (see first mail) should be allowed.
And then you can remove all references to WS and NL from the parser! - the language should then look much more like your language definition (wherever you got this from - if you invent(ed) the language yourself, still write a natural-language specification for the language, where you steal as much as possible from well-done other language specifications; Java and C# are the best ones for programming languages around - for other languages, I'd at least try to steal ideas from the whitespace and comment parts of those. And while I'm on the subject: If you design a language, always allow the possibility of some sort of comments; you will even need them in your tests).
b) In the rule
start : msg NL? EOF ;
put an ! behind EOF: You dont want this in the AST (unfortunately, it becomes a null Token - see the end of your output, which creates troubles off and on; and you get an artificial null root also - both are ugly).
(and remove the NL? - see a)).
c) You do a "double job" in the STR rules:
> STR
[...]
> : '"' ANYCHAR* '"'
> ;
>
> fragment ANYCHAR
> : (~'"')+
> ;
There is a + in ANYCHAR, and a * in STR. What you want is simply either
STR
[...]
: '"' (~'"')* '"'
;
or, if you want to keep this ANYCHAR rule,
STR
[...]
: '"' ANYCHAR* '"'
;
fragment ANYCHAR
: ~'"' // without +
;
d) You might also want to capture tabs ('\t') in your WS rule.
Regards
Harald
-------- Original-Nachricht --------
> Datum: Fri, 4 Jan 2008 09:25:28 -0500
> Von: body <antlr-list at splitbody.com>
> An: "Harald Mueller" <harald_m_mueller at gmx.de>
> CC: antlr-interest at antlr.org
> Betreff: Re: [antlr-interest] first steps with a lexer/parser
> i see!
>
> thank you for your patience, below is the modified lexer/parser.
>
> so for the input string
>
> { a=1 b="2" c="t" d="text" e="one two" f={ g="three four" h={ i=5 j="a ha"
> } } }
>
> it produces
>
> (MSG (PAIR a 1) (PAIR b 2) (PAIR c t) (PAIR d text) (PAIR e one two)
> (PAIR f (MSG (PAIR g three four) (PAIR h (MSG (PAIR i 5) (PAIR j a
> ha)))))) null
>
> so now i just have to write the tree grammar to walk it and take
> appropriate action, correct?
>
> thanks again for your help.
>
> ----------------------------
>
> grammar MsgString;
>
> options { output = AST; }
>
> tokens {
> PAIR;
> MSG;
> }
>
> start : msg NL? EOF ;
>
> msg : '{' WS nameValuePairExpr* WS '}' -> ^(MSG nameValuePairExpr*)
> ;
>
> nameValuePairExpr
> : NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;
>
> valueExpr
> : STR
> | INT
> | msg
> ;
>
> STR
> @after{
> setText(getText().substring(1, getText().length()-1));
> }
> : '"' ANYCHAR* '"'
> ;
>
> fragment ANYCHAR
> : (~'"')+
> ;
>
> INT : '0'..'9'+ ;
>
> NAME : ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
>
> WS : ' '+ ;
>
> NL : ('\n'|'\r')+ ;
>
> ----------------------------
>
> On 1/4/08, Harald Mueller <harald_m_mueller at gmx.de> wrote:
> > > a). it is indeed simpler if i use tokens instead of rules, but then i
> > > cannot strip the double quotes (! don't work unlike in the case of
> > > rules), and getting rid of them explicitly in code seems to be
> > > terribly hacky.
> >
> > No. The correct way is to normalize the token text in the lexer.
> Everything else is considered hacky in lexer+parser design.
> > (Yes, there is a bug in ANTLR 3.x, as far as I know, so that ! does not
> work in the lexer right now. Terence promised to work on this somewhen
> "now" - please complain about this!).
> >
> > >
> > > b). i could not simply skip() WS, because then they get removed from
> > > my strings within the quotes (and i want spaces preserved inside
> > > quotes).
> >
> > If this is the only reason for keeping the WS, it shows even more that
> the decision to do string assembly on the parser level is wrong. Please
> don't do this. One simple line in the lext
> >
> > $text = $text.substring(1,....);
> >
> > or a repaired ANTLR with two tiny !
> >
> > STRING : '"'! ~('"')* '"'!
> >
> > as opposed to thinking about WS in the grammar at multiple places, where
> it is (by language definition - at least I assume this) irrelevant: Please
> go for the time-proven, text-book decision.
> >
> > > or perhaps some sort of a flag that says that if i am inside a
> > > quoted string i do not throw away spaces.
> >
> > If at all, you can re-create the original text from the HIDDEN channel -
> there, all the characters are preserved.
> >
> > > d). i guess similar to a). i prefer semantic rather than symbolic...
> > > err.. symbols
> >
> > Yeah - here it is perfectly ok to use a sensible name instead of '='.
> >
> > Regards
> > Harald
> >
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
More information about the antlr-interest
mailing list