[antlr-interest] how to parse a number and a unit
Gavin Lambert
antlr at mirality.co.nz
Fri Sep 26 16:35:46 PDT 2008
At 06:56 27/09/2008, Sven Prevrhal wrote:
>I want to parse a number and a unit - with the code below
'ingline'
>can parse 5.2sm But not 5.2 sm - why?
>
>If I add a rule
>space : (' ' | '\t')+;
>and modify ingline to
>ingline : float space? unit;
>then both lines, with or without space. However, the 'space' now
>appears in the parse tree. Now if I parse a line with multiple
>spaces like so
>5.2 sm
>this works too BUT the 'space' no longer appears in the parse
tree
>(I use ANTLworks)!!!! I AM A BIG QUESTION MARK. And
>ingline : float WS? unit won't work at all.
[...]
>word : ~('\r' | '\n' | ' ' | '\t')+ ;
>unit : 'x ' | 'sm' | 'md' | 'lg' | 'cn' | 'pk' | 'pn' | 'dr'
> | 'ds' | 'ct' | 'bn' | 'sl' | 'ea' | 't ' | 'ts' | 'T '
> | 'tb' | 'fl' | 'c ' | 'pt' | 'qt' | 'ga' | 'oz' | 'lb'
> | 'ml' | 'cb' | 'cl' | 'dl' | 'l ' | 'mg' | 'cg' | 'dg'
> | 'g ' | 'kg' | ' ';
[...]
>WS : (' ' | '\t')+ {$channel = HIDDEN;} ;
I know that using character literals like this in the parser is
tempting, but when you're starting out using ANTLR it can lead to
just such confusing results. I think it's best to avoid them
entirely and just write explicit lexer rules. Not only does this
lead to less confusion, but you get to name the token types better
:)
First off, the reason why using "WS" in a parser rule doesn't work
is that the WS rule emits its token on a hidden channel, so it's
completely invisible to the parser -- it will simply never see a
WS, so can't possibly match it.
Next, when you defined the 'space' rule: as this is a parser rule,
and you're using quoted literals, *each* quoted string becomes a
token in its own right. In other words, this:
space : (' ' | '\t')+;
is basically equivalent to this:
T52 : ' ';
T53 : '\t';
space : (T52 | T53)+;
Now, if you combine this with the lexer rule WS, what you end up
with is a lexer that says "ok, if there's a single space, then
it's a T52 and the parser can see it. If there's more than one
space, it's a WS and the parser can't see any of them." Hopefully
it's now obvious why you're getting the behaviour you're
describing. (The "word" and "unit" rules will cause similar
problems.)
The way I usually prefer to approach parsing tasks is to first
write a complete standalone lexer (and a set of unit tests to
ensure it's producing the set of tokens I'm expecting). Only when
that is done do I write the parser rules that then interpret the
resulting tokens.
Another potentially useful rule of thumb: for some inputs,
whitespace truly is significant, and you *shouldn't* hide it. For
others, it's not, and you should. In the latter case, sometimes
you will have cases (eg. single-line comments) where the
whitespace needs to be treated as significant in one specific
area, but not globally; in these cases, you need to handle it all
in the lexer, since the lexer can still see the whitespace but the
parser cannot.
More information about the antlr-interest
mailing list