[antlr-interest] Non-disjoint tokens
Harald Mueller
harald_m_mueller at gmx.de
Sun Nov 25 04:02:28 PST 2007
Hi -
I think this is one of the real FAQs with ANTLR ... Antlr computes a minimal k to disambiguate the tokens AS PRESENT IN THE GRAMMAR. < and <HTML can be distinguished by just looking on more character ahead - so k=2 is enough. Even with syntactic predicates, this behavior does not change (why? - probably Terence has explained that somewhere - it's a feature):
HTML: ('<HTML>')? => '<HTML>';
LT: '<';
does not help, therefore: The syntactic predicate is abbreviated to the same 'starts with "<H"' condition.
What does help are semantic predicates (essentially, "arbitrary conditions"):
HTML: {input.LA(1)=='<' &&
input.LA(2)=='H' &&
input.LA(3)=='T' &&
input.LA(4)=='M' &&
input.LA(5)=='L' &&
input.LA(6)=='>'
}? => '<HTML>';
LT: '<';
If there is any other way to do this, I'd also like to know it!!
(and I'm not sure what happends at the end of the input with the above condition: say the end of the file is ...<HT> - would the access to input.LA(6) crash? - I think not, but I did not try it...).
Regards
Harald M.
-------- Original-Nachricht --------
> Datum: Sun, 25 Nov 2007 22:41:30 +1100
> Von: "Steve Bennett" <stevagewp at gmail.com>
> An: "antlr-interest Interest" <antlr-interest at antlr.org>
> Betreff: [antlr-interest] Non-disjoint tokens
> Given an input stream like this:
> blah <HTML> <NOTHING blah <HTMNOTL> ...
>
> I need to parse <HTML> as one token, and otherwise < as its own token.
> <HTMNOTL> would ideally be 3 tokens (<, HTMNOTL, >) but it can be one
> token.
>
> I'm finding that the with rules like this:
>
> HTML: '<HTML>';
> LT: '<';
>
> The lexer falls into a hole when it hits a sequence of characters with
> the same first two characters as '<HTML>'. In the debugger, the input
> pane shows that the <HT have been completely swallowed.
>
> Is there a way to avoid this? A work around? It seems like a pretty
> common thing to want - would this not mean that you couldn't have a
> token which matches 'private' yet an identifier 'privort'?
>
> I've tried refactoring along the lines of:
>
> LT: '<';
> HTML: LT 'HTML>;
> LT_LITERAL: LT;
>
> But this still doesn't seem to work.
>
> Any suggestions? This is a complete show stopper if i can't find a way
> around this. It's really crucial that the lexer can recognise this tag
> so it can tokenise the following input differently...
>
> Steve
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger
More information about the antlr-interest
mailing list