[antlr-interest] Non-disjoint tokens

Sun Nov 25 05:14:19 PST 2007

This 'design feature' has been discussed before, in the context of 
recognizing a range of numbers. The resulting faq entry, added September 
09, will probably help:

http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497

I have a similar problem in my code. The grammar has two different uses 
of '{'. One is to delimit a list, the other is to delimit an opaque 
block of code in some other language. I've included some excerpts below 
to show a "real world" example. What I did was to use a toggle at the 
lexer level. The various tokens that look inside code blocks know who 
they are, and they toggle a switch in the lexer.

This lets me use

    -> { /* opaque code block with code in it */ }

as well as

    -> TOKENS { t1, t2, t3 } // transparent block with stuff I have to 
parse in it.

I think my handling of code blocks corresponds roughly with your desired 
handling of <HTML> versus '<'.  One thing to keep in mind is that the 
lexer generator wants to optimize performance, so
it's hard to write a token rule that uses the lexer's internal 
variables. If you use a simple pattern in the description of the token, 
the lexer's mTokens() function is going to have too much information in 
it. If you look in my NestedCodeBlock rule, you'll see there's a .* '}' 
at the end to scare the lexer generator into not pretending it knows how 
long my token will be. Of course, I return before I get there, but I had 
to do that in a "creative" way, too, since otherwise the 
never-to-be-sufficiently-accursed java compiler will whine about 
unreachable code. (Naturally it considers that to be an error, not a 
warning. Stupid gee-whiz language wankers!)

Anyway, enough chatter, here's some code:

// This is just to get strings back as useable literals. Unlike everyone 
else on the list, I like using literals in my grammars.
tokens
{
    LCB = '{';
    TOKENS = 'TOKENS';
}

@lexer::members
{
    /**
    Tells lexer how to handle {...} blocks. If true, the block
    is treated as a big, opaque, possibly recursive string. If false, the
    '{' token type is returned and lexing continues inside the block.
    */
    private
    Boolean    matchCodeBlocks = true;
}

// Later in the grammar, here's an example of matching a code block 
(remember, blocks are opaque so there isn't much Ican do with them):
atAction
    : action=('@header' | '@members') CODE_BLOCK
      {
    addAction($action.text, $CODE_BLOCK.text);
      }
    ;

// Later in the grammar, here's an example of matching the 
open-curly-bracket only (no code block).
assertTokensMatch returns [ List<Object> token_info]
    @init
    {
    $token_info = new ArrayList<Object>();
    }
    : 'TOKENS' '{'
      ( tokenSpec { $token_info.add($tokenSpec.info); }
      )+
      '}'
    ;

// In the lexer, here's the details:
CODE_BLOCK
    : {matchCodeBlocks}? => NestedCodeBlock
      { setText(getText().substring(1, getText().length() - 1)); }
    | '{'
      {
    $type = LCB;
    matchCodeBlocks = true;
      }
    ;
// This has to be below code_block so it can be unreachable.
// Otherwise the definition in the tokens{} block takes
// precedence, and CODE_BLOCK can't be seen.
LCB: '{' ;

// Finally, here's the "good stuff."
// This fragment recognizes an opaque block of source code
// surrounded by curlies, possibly nested, with Java style
// comments (either kind) and quoted literals embedded,
fragment
NestedCodeBlock
    : '{'
    {
    loopNCB:
    do
    {
        int next_char = input.LA(1);

        switch (next_char)
        {
        case '}':
        break loopNCB;

        case '"': mQlit_Double(); break;
        case '\'': mQlit_Single(); break;
        case '\u00AB': mQlit_Willies(); break;
        case '<':
        if (input.LA(2) == '<')
            mQlit_Angles();
        else
            matchAny();
        break;

        case '/':
        switch (input.LA(2))
        {
        case '/': mSingleLineComment(); break;
        case '*': mMultiLineComment(); break;
        default: matchAny(); break;
        }
        break;

        case '{': mNestedCodeBlock(); break;
        default: matchAny(); break;
        }

        if (failed) return;
    }
    while (true);
    match('}');
    if (!failed) return;
    }
    .* '}'
    ;

// Here's a rule that toggles the code-block-matching:
OPTIONS
    : 'options' { matchCodeBlocks = false; }
    ;

TOKENS
    : 'TOKENS' { matchCodeBlocks = false; }
    ;

Harald Mueller wrote:
> Hi -
> I think this is one of the real FAQs with ANTLR ... Antlr computes a minimal k to disambiguate the tokens AS PRESENT IN THE GRAMMAR. < and <HTML can be distinguished by just looking on more character ahead - so k=2 is enough. Even with syntactic predicates, this behavior does not change (why? - probably Terence has explained that somewhere - it's a feature):
>
> HTML: ('<HTML>')? => '<HTML>';
> LT: '<';
>
> does not help, therefore: The syntactic predicate is abbreviated to the same 'starts with "<H"' condition.
>
> What does help are semantic predicates (essentially, "arbitrary conditions"):
>
> HTML: {input.LA(1)=='<' && 
>        input.LA(2)=='H' && 
>        input.LA(3)=='T' && 
>        input.LA(4)=='M' && 
>        input.LA(5)=='L' && 
>        input.LA(6)=='>'
>       }? => '<HTML>';
> LT: '<';
>
> If there is any other way to do this, I'd also like to know it!!
>
> (and I'm not sure what happends at the end of the input with the above condition: say the end of the file is ...<HT> - would the access to input.LA(6) crash? - I think not, but I did not try it...).
>
> Regards
> Harald M.
>
> -------- Original-Nachricht --------
>   
>> Datum: Sun, 25 Nov 2007 22:41:30 +1100
>> Von: "Steve Bennett" <stevagewp at gmail.com>
>> An: "antlr-interest Interest" <antlr-interest at antlr.org>
>> Betreff: [antlr-interest] Non-disjoint tokens
>>     
>
>   
>> Given an input stream like this:
>> blah <HTML> <NOTHING blah <HTMNOTL> ...
>>
>> I need to parse <HTML> as one token, and otherwise < as its own token.
>> <HTMNOTL> would ideally be 3 tokens (<, HTMNOTL, >) but it can be one
>> token.
>>
>> I'm finding that the with rules like this:
>>
>> HTML: '<HTML>';
>> LT: '<';
>>
>> The lexer falls into a hole when it hits a sequence of characters with
>> the same first two characters as '<HTML>'. In the debugger, the input
>> pane shows that the <HT have been completely swallowed.
>>
>> Is there a way to avoid this? A work around? It seems like a pretty
>> common thing to want - would this not mean that you couldn't have a
>> token which matches 'private' yet an identifier 'privort'?
>>
>> I've tried refactoring along the lines of:
>>
>> LT: '<';
>> HTML: LT 'HTML>;
>> LT_LITERAL: LT;
>>
>> But this still doesn't seem to work.
>>
>> Any suggestions? This is a complete show stopper if i can't find a way
>> around this. It's really crucial that the lexer can recognise this tag
>> so it can tokenise the following input differently...
>>
>> Steve
>>     
>
>