[antlr-interest] Simple (should be) lexer Q

Gerald B. Rosenberg gbr at newtechlaw.com
Fri Jun 10 09:47:20 PDT 2005


Should be simple, but I cannot see the problem.

Input is:

name is John zero.
name is John A. Smith one.
name is John Smith two.
name is John three.
name is John A. Smith, Inc. four.
name is John Smith, Inc. five.
name is John, Inc. six.
the Acme Corp. Universal is seven.
the Acme Corp. is eight.
the Acme Corporation is nine.
the Acme System, Ltd. is ten.

The goal is for all of the obvious person and company names to come out as 
single NAME tokens.  Works for names that have a comma, otherwise each 
capitalized word comes out as a separate NAME token.  Obviously (I think) 
the middle element of the NAME rule is not working, but why?  Or, better, 
how to fix?

(Making the COMMA WS CAPWORD phrase optional and removing the middle 
element produces the same results.)

Thanks,
Gerald


NAME:
       ( CAPWORD (WS CAPWORD)* COMMA WS CAPWORD ) => CAPWORD (WS CAPWORD)* 
COMMA WS CAPWORD
     | ( CAPWORD (WS CAPWORD)+ ) => CAPWORD (WS CAPWORD)+
     |   CAPWORD
;


protected
CAPWORD:
     UPPERLETTER (LETTER)* (PERIOD)?
;

protected
LETTER:
     UPPERLETTER | LOWERLETTER
;

protected
UPPERLETTER:
     'A'..'Z'
;

protected
LOWERLETTER:
     'a'..'z'
;

protected
PERIOD  '.';

----
Gerald B. Rosenberg, Esq.
NewTechLaw
285 Hamilton Avenue, Suite 520
Palo Alto, CA  94301-2576

650.325.2100  (office)  /  650.703.1724  (cell)
650.325.2107  (fax)

www.newtechlaw.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20050610/67c5adb2/attachment.html


More information about the antlr-interest mailing list