[antlr-interest] how to parse a number and a unit

Fri Sep 26 16:35:46 PDT 2008

At 06:56 27/09/2008, Sven Prevrhal wrote:
 >I want to parse a number and a unit - with the code below 
'ingline'
 >can parse 5.2sm But not 5.2 sm - why?
 >
 >If I add a rule
 >space	:	(' ' | '\t')+;
 >and modify ingline to
 >ingline	:	float space? unit;
 >then both lines, with or without space. However, the 'space' now 

 >appears in the parse tree. Now if I parse a  line with multiple
 >spaces like so
 >5.2     sm
 >this works too BUT the 'space' no longer appears in the parse 
tree
 >(I use ANTLworks)!!!! I AM A BIG QUESTION MARK. And
 >ingline	:	float WS? unit won't work at all.
[...]
 >word	:	~('\r' | '\n' | ' ' | '\t')+ ;
 >unit	:	'x ' | 'sm' | 'md' | 'lg' | 'cn' | 'pk' | 'pn' | 'dr'
 >      | 'ds' | 'ct' | 'bn' | 'sl' | 'ea' | 't ' | 'ts' | 'T '
 >      | 'tb' | 'fl' | 'c ' | 'pt' | 'qt' | 'ga' | 'oz' | 'lb'
 >      | 'ml' | 'cb' | 'cl' | 'dl' | 'l ' | 'mg' | 'cg' | 'dg'
 >      | 'g ' | 'kg' | ' ';
[...]
 >WS 	:	(' ' | '\t')+ {$channel = HIDDEN;} ;

I know that using character literals like this in the parser is 
tempting, but when you're starting out using ANTLR it can lead to 
just such confusing results.  I think it's best to avoid them 
entirely and just write explicit lexer rules.  Not only does this 
lead to less confusion, but you get to name the token types better 
:)

First off, the reason why using "WS" in a parser rule doesn't work 
is that the WS rule emits its token on a hidden channel, so it's 
completely invisible to the parser -- it will simply never see a 
WS, so can't possibly match it.

Next, when you defined the 'space' rule: as this is a parser rule, 
and you're using quoted literals, *each* quoted string becomes a 
token in its own right.  In other words, this:

space : (' ' | '\t')+;

is basically equivalent to this:

T52 : ' ';
T53 : '\t';
space : (T52 | T53)+;

Now, if you combine this with the lexer rule WS, what you end up 
with is a lexer that says "ok, if there's a single space, then 
it's a T52 and the parser can see it.  If there's more than one 
space, it's a WS and the parser can't see any of them."  Hopefully 
it's now obvious why you're getting the behaviour you're 
describing.  (The "word" and "unit" rules will cause similar 
problems.)

The way I usually prefer to approach parsing tasks is to first 
write a complete standalone lexer (and a set of unit tests to 
ensure it's producing the set of tokens I'm expecting).  Only when 
that is done do I write the parser rules that then interpret the 
resulting tokens.

Another potentially useful rule of thumb: for some inputs, 
whitespace truly is significant, and you *shouldn't* hide it.  For 
others, it's not, and you should.  In the latter case, sometimes 
you will have cases (eg. single-line comments) where the 
whitespace needs to be treated as significant in one specific 
area, but not globally; in these cases, you need to handle it all 
in the lexer, since the lexer can still see the whitespace but the 
parser cannot.