[antlr-interest] Q re use of Semantic Predicates
Gerald B. Rosenberg
gbr at newtechlaw.com
Mon Jun 6 10:32:18 PDT 2005
At 08:54 AM 6/6/2005, John D. Mitchell wrote:
> >>>>> "Gerald" == Gerald B Rosenberg <gbr at newtechlaw.com> writes:
>[...]
>
> > XMLTAG: '<' { this.inXmlTag = true; } WORD WS ATTR ('/')? '>' {
> > this.inXmlTag = false; }
> > ;
>
> > WORD: ( {this.inXmlTag}? ( LETTERS | PUNCT1 ) ) | ( LETTERS | NUMBERS |
> > PUNCT2 )
> > ;
>
>Is there some reason why you can just have two different rules? I.e.,
>WORD and TAGWORD, or somesuch?
>
>Have fun,
> John
The problem is that the lexer gets confused as to whether a string of
characters is a WORD or a TAGWORD; there are character streams that validly
fit both definitions. WORD is not, however, a true superset of
TAGWORD. The result is that the parser gets both WORD and TAGWORD
tokens. Accepting both in the parser as alternatives is not correct.
For example, both include the upper and lower alphabet. A double back
quote character is valid in WORD but not in TAGWORD. A colon is valid in
TAGWORD but not in WORD. However, TAGWORDs will only occur within XMLTAGs.
Need to distinguish WORD from TAGWORD in the character stream: <Paragraph
num="3">Turning now to paragraph 3 of the text ...
Ideally, I need the lexer to recognize and completely absorb the XMLTAG
tokens relative to the parser -- just need to capture attribute values,
akin to line and column numbers, for subsequent use in the tree-walker.
Have I missed something in how to set it up using two different rules? How
best to do what I am trying to do?
Appreciate the help,
Gerald
----
Gerald B. Rosenberg, Esq.
NewTechLaw
285 Hamilton Avenue, Suite 520
Palo Alto, CA 94301-2576
650.325.2100 (office) / 650.703.1724 (cell)
650.325.2107 (fax)
www.newtechlaw.com
More information about the antlr-interest
mailing list