[antlr-interest] Non-disjoint tokens

Harald Mueller harald_m_mueller at gmx.de
Sun Nov 25 04:02:28 PST 2007


Hi -
I think this is one of the real FAQs with ANTLR ... Antlr computes a minimal k to disambiguate the tokens AS PRESENT IN THE GRAMMAR. < and <HTML can be distinguished by just looking on more character ahead - so k=2 is enough. Even with syntactic predicates, this behavior does not change (why? - probably Terence has explained that somewhere - it's a feature):

HTML: ('<HTML>')? => '<HTML>';
LT: '<';

does not help, therefore: The syntactic predicate is abbreviated to the same 'starts with "<H"' condition.

What does help are semantic predicates (essentially, "arbitrary conditions"):

HTML: {input.LA(1)=='<' && 
       input.LA(2)=='H' && 
       input.LA(3)=='T' && 
       input.LA(4)=='M' && 
       input.LA(5)=='L' && 
       input.LA(6)=='>'
      }? => '<HTML>';
LT: '<';

If there is any other way to do this, I'd also like to know it!!

(and I'm not sure what happends at the end of the input with the above condition: say the end of the file is ...<HT> - would the access to input.LA(6) crash? - I think not, but I did not try it...).

Regards
Harald M.

-------- Original-Nachricht --------
> Datum: Sun, 25 Nov 2007 22:41:30 +1100
> Von: "Steve Bennett" <stevagewp at gmail.com>
> An: "antlr-interest Interest" <antlr-interest at antlr.org>
> Betreff: [antlr-interest] Non-disjoint tokens

> Given an input stream like this:
> blah <HTML> <NOTHING blah <HTMNOTL> ...
> 
> I need to parse <HTML> as one token, and otherwise < as its own token.
> <HTMNOTL> would ideally be 3 tokens (<, HTMNOTL, >) but it can be one
> token.
> 
> I'm finding that the with rules like this:
> 
> HTML: '<HTML>';
> LT: '<';
> 
> The lexer falls into a hole when it hits a sequence of characters with
> the same first two characters as '<HTML>'. In the debugger, the input
> pane shows that the <HT have been completely swallowed.
> 
> Is there a way to avoid this? A work around? It seems like a pretty
> common thing to want - would this not mean that you couldn't have a
> token which matches 'private' yet an identifier 'privort'?
> 
> I've tried refactoring along the lines of:
> 
> LT: '<';
> HTML: LT 'HTML>;
> LT_LITERAL: LT;
> 
> But this still doesn't seem to work.
> 
> Any suggestions? This is a complete show stopper if i can't find a way
> around this. It's really crucial that the lexer can recognise this tag
> so it can tokenise the following input differently...
> 
> Steve

-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger


More information about the antlr-interest mailing list