[antlr-interest] Re: Differentiating keywords in parser and identifiers in lexers
micheal_jor <open.zone at virgin.net>
open.zone at virgin.net
Thu Jan 30 03:45:20 PST 2003
> yup i did. my codes are as follows:
OK.
[Once again, I have just looked at the code so I might have missed
something.]
I looked at the code and applied it to the only sample you've given
of the data you wish to parse (from Msg# 7128):
angle focus : 0.0005
color : blue
line width : 12
Looking at your grammar, this data generate errors because:
1. "angle focus" is not a keyword (keyphrase?), but "angle factor" is.
2. "color" is not a keyword, but "initial color" is.
3. "line width" is not a keyword, but "initial line width" is.
4. "0.0005" will be matched as two NUMERIC tokens.
The problem in [4] above lies within the following rules (I assume
that [1], [2] and [3] are obvious):
protected SYMBOL : ('-'|'+'|'%'|'/'|'.'|',')+;
protected DIGIT : ( '0'..'9' ) ;
NUMERIC :(SYMBOL)? (DIGIT)+ (SYMBOL DIGIT)*
This says that a NUMERIC has:
a) an optional leading SYMBOL followed by
b) one or more DIGITs followed by
c) zero or more SYMBOL-DIGIT sequences
e.g: (In examples SY means SYMBOL and DG means DIGIT)
Hence the following are VALID numerics:
1) %0000.9+8/%..+7 ==> %=[SY] 0000=[(DG)+] .9+8/%..+7=[(SY DG)*]
{ since .9=[SY DG] +8=[SY DG] /%..+7=[SY DG] }
2) 10 ==> no-leading-SY 10=[(DG+)]
3) -10 ==> -=[SY] 10=[(DG+)]
4) /+,,-%..10 ==> /+,,-%..=[SY] 10=[(DG+)]
And the following are INVALID numerics:
1) 0.0005 ==> no-leading-SY 0=[(DG)+] .0=[(DG)*] <NUMERIC-MATCHED>
no-leading-SY 005=[(DG+)] <NUMERIC-MATCHED>
2) -10.10 ==> -=[SY] 10=[(DG)+] .1=[(DG)*] <NUMERIC-MATCHED>
no-leading-SY 0=[(DG+)] <NUMERIC-MATCHED>
So, the keywords are being correctly indentified as keywords and not
identifiers but some of your rules need more work to better reflect
the true syntax/semantics of the language (i.e data) you are trying
to parse.
I suspect for instance that NUMERIC would be better written as:
protected SIGN : ('-'|'+');
protected SYMBOL : (SIGN|'%'|'/'|'.');
protected DIGIT : ( '0'..'9' ) ;
NUMERIC : (SIGN)? (DIGIT)+ (SYMBOL (DIGIT)+)*
This says that a NUMERIC has:
a) an optional leading SIGN followed by
b) one or more DIGITs followed by
c) zero or more sequences of: a SYMBOL followed by one or more DIGITs
I removed ',' from SYMBOL, because I didn't know what it meant. You
would know that and should put it back in the right place. If it
means "comma-separated list of NUMERIC", perhaps it should be handled
in the parser?
Once again, have a look at Ter's Getting started guide and the
tutorials it links to. Also look at the examples directory for ideas.
The java grammar for instance has a good, well-tested rule for C/C++-
style multiline comments. Your version would report incorrect line
numbers for all rules after a multiline comment.
Micheal
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list