[antlr-interest] first steps with a lexer/parser

body antlr-list at splitbody.com
Fri Jan 4 07:03:12 PST 2008


thanks again for the quick response, it really, really helps.

> a) WS and NL should get a marker
>    { $channel = HIDDEN; }
> so that the parser does not even see them - because I'm quite sure that also
>    {a=1}
> etc. (see first mail) should be allowed.
> And then you can remove all references to WS and NL from the parser! - the language should then look much more like your language definition ---skipped ---

all great points! but luckily it is an incoming data file; so i just
replicated it. but you are right; so i used hidden channel and it made
the grammar much simpler without losing my spaces inside the string.

here's a question - what would i have to change if i had escaped
quotes inside of the string (\")? then surely i would have to use .*
to match the string, and then do something different inside of it.

>
> b) In the rule
>     start  :    msg NL? EOF ;
> put an ! behind EOF: You dont want this in the AST (unfortunately, it becomes a null Token - see the end of your output, which creates troubles off and on; and you get an artificial null root also - both are ugly).
> (and remove the NL? - see a)).

ah! i was wondering about that null! and i forgot about hidden channel
for NL - good point.

>
> c) You do a "double job" in the STR rules:
>
> > STR
> [...]
> >        :    '"' ANYCHAR* '"'
> >        ;
> >
> > fragment ANYCHAR
> >        :    (~'"')+
> >        ;
>
> There is a + in ANYCHAR, and a * in STR. What you want is simply either
>
>  STR
> [...]
>         :    '"' (~'"')* '"'
>         ;
>
> or, if you want to keep this ANYCHAR rule,
>
> STR
> [...]
>        :    '"' ANYCHAR* '"'
>        ;
>
> fragment ANYCHAR
>        :    ~'"'         // without +
>        ;

yes, you are right, both former and latter seem to work.

>
> d) You might also want to capture tabs ('\t') in your WS rule.

done, thank you.

------------------------

grammar MsgString;

options { output = AST; }

tokens {
	PAIR;
	MSG;
}

start  :    msg EOF! ;

msg    :    '{' nameValuePairExpr* '}' -> ^(MSG nameValuePairExpr*) ;

nameValuePairExpr
       :    NAME '=' valueExpr WS? -> ^(PAIR NAME valueExpr) ;

valueExpr
       :    STR
       |    INT
       |    msg
       ;

STR
       @after{
            setText(getText().substring(1, getText().length()-1));
       }
       :    '"' ~'"'* '"'
       ;

INT    :    '0'..'9'+ ;

NAME   :    ('a'..'z'|'A'..'Z'|'0'..'9')+ ;

WS     :    (' '|'\t')+ { $channel = HIDDEN; } ;

NL     :    ('\n'|'\r')+ { $channel = HIDDEN; } ;

------------------------

and input/output:

{ a=1 b="2" c="t" d="text" e="one two" f={ g="three four" h={ i=5 j="a ha" } } }

(MSG (PAIR a 1) (PAIR b 2) (PAIR c t) (PAIR d text) (PAIR e one two)
(PAIR f (MSG (PAIR g three four) (PAIR h (MSG (PAIR i 5) (PAIR j a
ha))))))


More information about the antlr-interest mailing list