[antlr-interest] Multiple lexer tokens per rule

Ken Williams ken.williams at thomsonreuters.com
Fri Jun 4 09:02:14 PDT 2010




On 6/3/10 5:36 PM, "Junkman" <j at junkwallah.org> wrote:

> Try this to get you started: [...]

Thanks, that's a good start.  There's still some bookkeeping I'm not
getting, though.  I seem to have to queue them in the reverse order that I
want them out - in the Lexer I do 'queueUp(tok1); emit(tok2);' and then in
nextToken() I return the queued token first.  But then for some reason I get
the tokens in the sequence 'tok2 tok1'.

It seems like maybe somewhere in the generated code, something¹s accessing
tokens directly in the Œstate¹ member, or something¹s getting confused by
using Œindex¹, or something like that.

My complete [toy] grammar is below.  When I use it, I get the following
results:

23      -> DIGITS                  *good*
23,     -> PUNC DIGITS       *bad*
23,450  -> NUMERIC         *good*
23,450, -> PUNC NUMERIC  *bad*


----------------------------------------------------
grammar testg;

options { backtrack=true; memoize=true; output=AST; }

tokens { PUNC; DIGITS; }

@lexer::header{ 
    package com.tr.research.cites;
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
} 
@parser::header{ package com.tr.research.cites; }

@lexer::members {
    protected Pattern trailingPunc = Pattern.compile("[^0-9]+$");
    protected void fixNum(String text) {
        if (text.matches("^[0-9]+$")) { emit(new CommonToken(DIGITS, text));
return; }
        if (text.matches("^.*[0-9]+$")) { emit(new CommonToken(NUMERIC,
text)); return; }
        
        Matcher m = trailingPunc.matcher(text);
        if (!m.find())
            throw new RuntimeException("Can't figure out numeric token '" +
text + "'");
        
        String prefix = text.substring(0, m.start());
        String suffix = text.substring(m.start());
        
        queueUp( new CommonToken(prefix.matches("^[0-9]+$") ? DIGITS :
NUMERIC, prefix) );
        emit(new CommonToken( PUNC, suffix ));
    }

    // Queue to hold additional tokens
    private java.util.Queue<Token> tokenQueue = new
java.util.LinkedList<Token>();

    // Include queue in reset().
    public void reset() {
        super.reset();
        tokenQueue.clear();
    }

    // Queued tokens are returned before matching a new token.
    public Token nextToken() {
        return tokenQueue.isEmpty() ? super.nextToken() : tokenQueue.poll();
    }
    
    public void queueUp(Token t) {
        tokenQueue.add(t);
    }
}

cite    :    token+ EOF ;
token    : DIGITS | NUMERIC | PUNC ;
WS    :    ( ' ' | '\t'| '\f' | '\n' | '\r' ) {skip();} ;

fragment DIGIT    : '0'..'9' ;
NUMERIC    :    DIGIT (DIGIT | '-' | ',' | '.')*  {fixNum($text);} ;
----------------------------------------------------


-- 
Ken Williams
Sr. Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.williams at thomsonreuters.com




More information about the antlr-interest mailing list