[antlr-interest] first steps with a lexer/parser

Fri Jan 4 02:02:14 PST 2008

> a). it is indeed simpler if i use tokens instead of rules, but then i
> cannot strip the double quotes (! don't work unlike in the case of
> rules), and getting rid of them explicitly in code seems to be
> terribly hacky.

No. The correct way is to normalize the token text in the lexer. Everything else is considered hacky in lexer+parser design.
(Yes, there is a bug in ANTLR 3.x, as far as I know, so that ! does not work in the lexer right now. Terence promised to work on this somewhen "now" - please complain about this!).

> 
> b). i could not simply skip() WS, because then they get removed from
> my strings within the quotes (and i want spaces preserved inside
> quotes). 

If this is the only reason for keeping the WS, it shows even more that the decision to do string assembly on the parser level is wrong. Please don't do this. One simple line in the lext

    $text = $text.substring(1,....);

or a repaired ANTLR with two tiny !

    STRING : '"'! ~('"')* '"'!

as opposed to thinking about WS in the grammar at multiple places, where it is (by language definition - at least I assume this) irrelevant: Please go for the time-proven, text-book decision.

> or perhaps some sort of a flag that says that if i am inside a
> quoted string i do not throw away spaces.

If at all, you can re-create the original text from the HIDDEN channel - there, all the characters are preserved.

> d). i guess similar to a). i prefer semantic rather than symbolic...
> err.. symbols

Yeah - here it is perfectly ok to use a sensible name instead of '='.

Regards
Harald

> 
> thanks again for the pointers, i will keep digging.
> 
> -a
> 
> On 1/3/08, Harald Mueller <harald_m_mueller at gmx.de> wrote:
> > Hi -
> >
> > a) A quoted string should be a token, IMO, not a rule (except ... see
> the thread on parsing BSDL where we quarrel about "structured string parsing"
> ... but this would not be "first steps").
> > (I am constantly unsure whether ! works in lexer rules - so, if you wnat
> to strip the " and it does NOT work, first complain to Terence; and then
> do something like
> >     $text = $text.Trim('\"'); // in C#
> > or
> >     $text = $text.substring(1,$text.length-1); // in Java
> >
> > b) Are you really sure that whitespace is that significant? According to
> your grammar,
> >
> > {a=1}
> >
> > is not allowed: You require a WS after { and before } - and WS is at
> least one blank. Also, { a = 1 } would be wrong: No WS around = ...
> > Almost all languages I know *ignore* whitespace. In ANTLR, you do this
> by sending the WS tokens to the HIDDEN channel via { $channel = HIDDEN; }.
> >
> > c) There is no good reason to have artificial roots for single tokens -
> instead of ^(INT_VAL INT), just use the INT; same for STR_VAL.
> >
> > d) Also for the '=', I would not add an artificial symbol, but simply
> use the '=' as root:
> >
> >      ...: NAME '='^ valueExpr;
> >
> > - but this is a matter of taste, I'd say.
> >
> > Regards
> > Harald
> >
> > -------- Original-Nachricht --------
> > > Datum: Thu, 3 Jan 2008 08:40:38 -0500
> > > Von: body <antlr-list at splitbody.com>
> > > An: antlr-interest at antlr.org
> > > Betreff: [antlr-interest] first steps with a lexer/parser
> >
> > > hello,
> > >
> > > i am trying to deal with the messages that look like this:
> > >
> > > { a=1 b="2" c="t" d="stuff" e="one two" f={ g="three four" h={ i=5
> > > j="a ha" } } }
> > >
> > > below is my lexer/parser. it seems to work and emit proper-looking
> > > tree, but i want to run it by you, because it does not feel right.
> > >
> > > it seems like i should be using fragments somewhere, also i cannot
> > > figure out how to build a proper tree grammar out of it.
> > >
> > > any suggestions appreciated.
> > >
> > > thank you.
> > >
> > > -----------------
> > > grammar MsgString;
> > >
> > > options { output = AST; }
> > >
> > > tokens {
> > >       PAIR;
> > >       MSG;
> > >       STR_VAL;
> > >       INT_VAL;
> > > }
> > >
> > > start  :    msg NL? EOF -> ^(MSG msg) ;
> > >
> > > msg    :    '{' WS nameValuePairExpr* WS '}' -> ^(MSG
> nameValuePairExpr*)
> > > ;
> > >
> > > nameValuePairExpr
> > >        :    NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;
> > >
> > > valueExpr
> > >        :    quotedString -> ^(STR_VAL quotedString)
> > >        |    INT -> ^(INT_VAL INT)
> > >        |    msg
> > >        ;
> > >
> > > quotedString
> > >        :    '"'! .* '"'!
> > >        ;
> > >
> > > INT    :    '0'..'9'+ ;
> > >
> > > NAME   :    ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
> > >
> > > WS     :    ' '+ ;
> > >
> > > NL     :    ('\n'|'\r')+ ;
> > > -----------------
> >
> > --
> > Psssst! Schon vom neuen GMX MultiMessenger gehört?
> > Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger?did=10
> >

-- 
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail