[antlr-interest] Q re use of Semantic Predicates

Mon Jun 6 10:32:18 PDT 2005

At 08:54 AM 6/6/2005, John D. Mitchell wrote:

> >>>>> "Gerald" == Gerald B Rosenberg <gbr at newtechlaw.com> writes:
>[...]
>
> > XMLTAG: '<' { this.inXmlTag = true; } WORD WS ATTR ('/')?  '>' {
> > this.inXmlTag = false; }
> > ;
>
> > WORD: ( {this.inXmlTag}? ( LETTERS | PUNCT1 ) ) | ( LETTERS | NUMBERS |
> > PUNCT2 )
> > ;
>
>Is there some reason why you can just have two different rules?  I.e.,
>WORD and TAGWORD, or somesuch?
>
>Have fun,
>         John

The problem is that the lexer gets confused as to whether a string of 
characters is a WORD or a TAGWORD; there are character streams that validly 
fit both definitions.  WORD is not, however, a true superset of 
TAGWORD.  The result is that the parser gets both WORD and TAGWORD 
tokens.  Accepting both in the parser as alternatives is not correct.

For example, both include the upper and lower alphabet.  A double back 
quote character is valid in WORD but not in TAGWORD.  A colon is valid in 
TAGWORD but not in WORD.  However, TAGWORDs will only occur within XMLTAGs.

Need to distinguish WORD from TAGWORD in the character stream: <Paragraph 
num="3">Turning now to paragraph 3 of the text ...

Ideally, I need the lexer to recognize and completely absorb the XMLTAG 
tokens relative to the parser -- just need to capture attribute values, 
akin to line and column numbers, for subsequent use in the tree-walker.

Have I missed something in how to set it up using two different rules?  How 
best to do what I am trying to do?

Appreciate the help,
Gerald

----
Gerald B. Rosenberg, Esq.
NewTechLaw
285 Hamilton Avenue, Suite 520
Palo Alto, CA  94301-2576

650.325.2100  (office)  /  650.703.1724  (cell)
650.325.2107  (fax)

www.newtechlaw.com