[antlr-interest] Re: Differentiating keywords in parser and identifiers in lexers

Thu Jan 30 03:45:20 PST 2003

> yup i did. my codes are as follows:

OK.

[Once again, I have just looked at the code so I might have missed 
something.]

I looked at the code and applied it to the only sample you've given 
of the data you wish to parse (from Msg# 7128):

angle focus : 0.0005
color : blue
line width : 12

Looking at your grammar, this data generate errors because:
1. "angle focus" is not a keyword (keyphrase?), but "angle factor" is.
2. "color" is not a keyword, but "initial color" is.
3. "line width" is not a keyword, but "initial line width" is.
4. "0.0005" will be matched as two NUMERIC tokens.

The problem in [4] above lies within the following rules (I assume 
that [1], [2] and [3] are obvious):

protected SYMBOL : ('-'|'+'|'%'|'/'|'.'|',')+;
protected DIGIT   : ( '0'..'9' ) ; 
NUMERIC :(SYMBOL)? (DIGIT)+ (SYMBOL DIGIT)* 

This says that a NUMERIC has:
a) an optional leading SYMBOL followed by
b) one or more DIGITs followed by
c) zero or more SYMBOL-DIGIT sequences

e.g: (In examples SY means SYMBOL and DG means DIGIT)

Hence the following are VALID numerics:
1) %0000.9+8/%..+7 ==> %=[SY] 0000=[(DG)+] .9+8/%..+7=[(SY DG)*]
                       { since .9=[SY DG] +8=[SY DG] /%..+7=[SY DG] }

2) 10 ==> no-leading-SY 10=[(DG+)]

3) -10 ==> -=[SY] 10=[(DG+)]

4) /+,,-%..10 ==> /+,,-%..=[SY] 10=[(DG+)]

And the following are INVALID numerics:
1) 0.0005 ==> no-leading-SY 0=[(DG)+] .0=[(DG)*] <NUMERIC-MATCHED>
              no-leading-SY 005=[(DG+)]          <NUMERIC-MATCHED>

2) -10.10 ==> -=[SY] 10=[(DG)+] .1=[(DG)*] <NUMERIC-MATCHED>
              no-leading-SY 0=[(DG+)]      <NUMERIC-MATCHED>

So, the keywords are being correctly indentified as keywords and not 
identifiers but some of your rules need more work to better reflect 
the true syntax/semantics of the language (i.e data) you are trying 
to parse.

I suspect for instance that NUMERIC would be better written as:

protected SIGN   : ('-'|'+');
protected SYMBOL : (SIGN|'%'|'/'|'.');
protected DIGIT  : ( '0'..'9' ) ; 
NUMERIC          : (SIGN)? (DIGIT)+ (SYMBOL (DIGIT)+)* 

This says that a NUMERIC has:
a) an optional leading SIGN followed by
b) one or more DIGITs followed by
c) zero or more sequences of: a SYMBOL followed by one or more DIGITs

I removed ',' from SYMBOL, because I didn't know what it meant. You 
would know that and should put it back in the right place. If it 
means "comma-separated list of NUMERIC", perhaps it should be handled 
in the parser?

Once again, have a look at Ter's Getting started guide and the 
tutorials it links to. Also look at the examples directory for ideas. 
The java grammar for instance has a good, well-tested rule for C/C++-
style multiline comments. Your version would report incorrect line 
numbers for all rules after a multiline comment.

Micheal

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/