[antlr-interest] first steps with a lexer/parser

Thu Jan 3 06:59:11 PST 2008

Hi -

a) A quoted string should be a token, IMO, not a rule (except ... see the thread on parsing BSDL where we quarrel about "structured string parsing" ... but this would not be "first steps").
(I am constantly unsure whether ! works in lexer rules - so, if you wnat to strip the " and it does NOT work, first complain to Terence; and then do something like
    $text = $text.Trim('\"'); // in C#
or
    $text = $text.substring(1,$text.length-1); // in Java

b) Are you really sure that whitespace is that significant? According to your grammar,

{a=1}

is not allowed: You require a WS after { and before } - and WS is at least one blank. Also, { a = 1 } would be wrong: No WS around = ...
Almost all languages I know *ignore* whitespace. In ANTLR, you do this by sending the WS tokens to the HIDDEN channel via { $channel = HIDDEN; }.

c) There is no good reason to have artificial roots for single tokens - instead of ^(INT_VAL INT), just use the INT; same for STR_VAL.

d) Also for the '=', I would not add an artificial symbol, but simply use the '=' as root:

     ...: NAME '='^ valueExpr;

- but this is a matter of taste, I'd say.

Regards
Harald

-------- Original-Nachricht --------
> Datum: Thu, 3 Jan 2008 08:40:38 -0500
> Von: body <antlr-list at splitbody.com>
> An: antlr-interest at antlr.org
> Betreff: [antlr-interest] first steps with a lexer/parser

> hello,
> 
> i am trying to deal with the messages that look like this:
> 
> { a=1 b="2" c="t" d="stuff" e="one two" f={ g="three four" h={ i=5
> j="a ha" } } }
> 
> below is my lexer/parser. it seems to work and emit proper-looking
> tree, but i want to run it by you, because it does not feel right.
> 
> it seems like i should be using fragments somewhere, also i cannot
> figure out how to build a proper tree grammar out of it.
> 
> any suggestions appreciated.
> 
> thank you.
> 
> -----------------
> grammar MsgString;
> 
> options { output = AST; }
> 
> tokens {
> 	PAIR;
> 	MSG;
> 	STR_VAL;
> 	INT_VAL;
> }
> 
> start  :    msg NL? EOF -> ^(MSG msg) ;
> 
> msg    :    '{' WS nameValuePairExpr* WS '}' -> ^(MSG nameValuePairExpr*)
> ;
> 
> nameValuePairExpr
>        :    NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;
> 
> valueExpr
>        :    quotedString -> ^(STR_VAL quotedString)
>        |    INT -> ^(INT_VAL INT)
>        |    msg
>        ;
> 
> quotedString
>        :    '"'! .* '"'!
>        ;
> 
> INT    :    '0'..'9'+ ;
> 
> NAME   :    ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
> 
> WS     :    ' '+ ;
> 
> NL     :    ('\n'|'\r')+ ;
> -----------------

-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger?did=10