[antlr-interest] Multiple lexer tokens per rule

Fri Jun 4 12:53:12 PDT 2010

You need to use a collection that gives out the entries in the order they were added:

http://java.sun.com/docs/books/tutorial/collections/interfaces/queue.html

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ken Williams
> Sent: Friday, June 04, 2010 9:02 AM
> To: Junkman
> Cc: ANTLR list
> Subject: Re: [antlr-interest] Multiple lexer tokens per rule
> 
> 
> 
> 
> On 6/3/10 5:36 PM, "Junkman" <j at junkwallah.org> wrote:
> 
> > Try this to get you started: [...]
> 
> Thanks, that's a good start.  There's still some bookkeeping I'm not
> getting, though.  I seem to have to queue them in the reverse order
> that I
> want them out - in the Lexer I do 'queueUp(tok1); emit(tok2);' and then
> in
> nextToken() I return the queued token first.  But then for some reason
> I get
> the tokens in the sequence 'tok2 tok1'.
> 
> It seems like maybe somewhere in the generated code, something¹s
> accessing
> tokens directly in the Œstate¹ member, or something¹s getting confused
> by
> using Œindex¹, or something like that.
> 
> My complete [toy] grammar is below.  When I use it, I get the following
> results:
> 
> 23      -> DIGITS                  *good*
> 23,     -> PUNC DIGITS       *bad*
> 23,450  -> NUMERIC         *good*
> 23,450, -> PUNC NUMERIC  *bad*
> 
> 
> ----------------------------------------------------
> grammar testg;
> 
> options { backtrack=true; memoize=true; output=AST; }
> 
> tokens { PUNC; DIGITS; }
> 
> @lexer::header{
>     package com.tr.research.cites;
>     import java.util.regex.Pattern;
>     import java.util.regex.Matcher;
> }
> @parser::header{ package com.tr.research.cites; }
> 
> @lexer::members {
>     protected Pattern trailingPunc = Pattern.compile("[^0-9]+$");
>     protected void fixNum(String text) {
>         if (text.matches("^[0-9]+$")) { emit(new CommonToken(DIGITS,
> text));
> return; }
>         if (text.matches("^.*[0-9]+$")) { emit(new CommonToken(NUMERIC,
> text)); return; }
> 
>         Matcher m = trailingPunc.matcher(text);
>         if (!m.find())
>             throw new RuntimeException("Can't figure out numeric token
> '" +
> text + "'");
> 
>         String prefix = text.substring(0, m.start());
>         String suffix = text.substring(m.start());
> 
>         queueUp( new CommonToken(prefix.matches("^[0-9]+$") ? DIGITS :
> NUMERIC, prefix) );
>         emit(new CommonToken( PUNC, suffix ));
>     }
> 
>     // Queue to hold additional tokens
>     private java.util.Queue<Token> tokenQueue = new
> java.util.LinkedList<Token>();
> 
>     // Include queue in reset().
>     public void reset() {
>         super.reset();
>         tokenQueue.clear();
>     }
> 
>     // Queued tokens are returned before matching a new token.
>     public Token nextToken() {
>         return tokenQueue.isEmpty() ? super.nextToken() :
> tokenQueue.poll();
>     }
> 
>     public void queueUp(Token t) {
>         tokenQueue.add(t);
>     }
> }
> 
> cite    :    token+ EOF ;
> token    : DIGITS | NUMERIC | PUNC ;
> WS    :    ( ' ' | '\t'| '\f' | '\n' | '\r' ) {skip();} ;
> 
> fragment DIGIT    : '0'..'9' ;
> NUMERIC    :    DIGIT (DIGIT | '-' | ',' | '.')*  {fixNum($text);} ;
> ----------------------------------------------------
> 
> 
> --
> Ken Williams
> Sr. Research Scientist
> Thomson Reuters
> Phone: 651-848-7712
> ken.williams at thomsonreuters.com
> 
> 
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address