[antlr-interest] first steps with a lexer/parser

Fri Jan 4 06:25:28 PST 2008

i see!

thank you for your patience, below is the modified lexer/parser.

so for the input string

{ a=1 b="2" c="t" d="text" e="one two" f={ g="three four" h={ i=5 j="a ha" } } }

it produces

(MSG (PAIR a 1) (PAIR b 2) (PAIR c t) (PAIR d text) (PAIR e one two)
(PAIR f (MSG (PAIR g three four) (PAIR h (MSG (PAIR i 5) (PAIR j a
ha)))))) null

so now i just have to write the tree grammar to walk it and take
appropriate action, correct?

thanks again for your help.

----------------------------

grammar MsgString;

options { output = AST; }

tokens {
	PAIR;
	MSG;
}

start  :    msg NL? EOF ;

msg    :    '{' WS nameValuePairExpr* WS '}' -> ^(MSG nameValuePairExpr*) ;

nameValuePairExpr
       :    NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;

valueExpr
       :    STR
       |    INT
       |    msg
       ;

STR
       @after{
            setText(getText().substring(1, getText().length()-1));
       }
       :    '"' ANYCHAR* '"'
       ;

fragment ANYCHAR
       :    (~'"')+
       ;

INT    :    '0'..'9'+ ;

NAME   :    ('a'..'z'|'A'..'Z'|'0'..'9')+ ;

WS     :    ' '+ ;

NL     :    ('\n'|'\r')+ ;

----------------------------

On 1/4/08, Harald Mueller <harald_m_mueller at gmx.de> wrote:
> > a). it is indeed simpler if i use tokens instead of rules, but then i
> > cannot strip the double quotes (! don't work unlike in the case of
> > rules), and getting rid of them explicitly in code seems to be
> > terribly hacky.
>
> No. The correct way is to normalize the token text in the lexer. Everything else is considered hacky in lexer+parser design.
> (Yes, there is a bug in ANTLR 3.x, as far as I know, so that ! does not work in the lexer right now. Terence promised to work on this somewhen "now" - please complain about this!).
>
> >
> > b). i could not simply skip() WS, because then they get removed from
> > my strings within the quotes (and i want spaces preserved inside
> > quotes).
>
> If this is the only reason for keeping the WS, it shows even more that the decision to do string assembly on the parser level is wrong. Please don't do this. One simple line in the lext
>
>     $text = $text.substring(1,....);
>
> or a repaired ANTLR with two tiny !
>
>     STRING : '"'! ~('"')* '"'!
>
> as opposed to thinking about WS in the grammar at multiple places, where it is (by language definition - at least I assume this) irrelevant: Please go for the time-proven, text-book decision.
>
> > or perhaps some sort of a flag that says that if i am inside a
> > quoted string i do not throw away spaces.
>
> If at all, you can re-create the original text from the HIDDEN channel - there, all the characters are preserved.
>
> > d). i guess similar to a). i prefer semantic rather than symbolic...
> > err.. symbols
>
> Yeah - here it is perfectly ok to use a sensible name instead of '='.
>
> Regards
> Harald
>