[antlr-interest] Multiple lexer tokens per rule
Jim Idle
jimi at temporal-wave.com
Fri Jun 4 12:53:12 PDT 2010
You need to use a collection that gives out the entries in the order they were added:
http://java.sun.com/docs/books/tutorial/collections/interfaces/queue.html
Jim
> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ken Williams
> Sent: Friday, June 04, 2010 9:02 AM
> To: Junkman
> Cc: ANTLR list
> Subject: Re: [antlr-interest] Multiple lexer tokens per rule
>
>
>
>
> On 6/3/10 5:36 PM, "Junkman" <j at junkwallah.org> wrote:
>
> > Try this to get you started: [...]
>
> Thanks, that's a good start. There's still some bookkeeping I'm not
> getting, though. I seem to have to queue them in the reverse order
> that I
> want them out - in the Lexer I do 'queueUp(tok1); emit(tok2);' and then
> in
> nextToken() I return the queued token first. But then for some reason
> I get
> the tokens in the sequence 'tok2 tok1'.
>
> It seems like maybe somewhere in the generated code, something¹s
> accessing
> tokens directly in the Œstate¹ member, or something¹s getting confused
> by
> using Œindex¹, or something like that.
>
> My complete [toy] grammar is below. When I use it, I get the following
> results:
>
> 23 -> DIGITS *good*
> 23, -> PUNC DIGITS *bad*
> 23,450 -> NUMERIC *good*
> 23,450, -> PUNC NUMERIC *bad*
>
>
> ----------------------------------------------------
> grammar testg;
>
> options { backtrack=true; memoize=true; output=AST; }
>
> tokens { PUNC; DIGITS; }
>
> @lexer::header{
> package com.tr.research.cites;
> import java.util.regex.Pattern;
> import java.util.regex.Matcher;
> }
> @parser::header{ package com.tr.research.cites; }
>
> @lexer::members {
> protected Pattern trailingPunc = Pattern.compile("[^0-9]+$");
> protected void fixNum(String text) {
> if (text.matches("^[0-9]+$")) { emit(new CommonToken(DIGITS,
> text));
> return; }
> if (text.matches("^.*[0-9]+$")) { emit(new CommonToken(NUMERIC,
> text)); return; }
>
> Matcher m = trailingPunc.matcher(text);
> if (!m.find())
> throw new RuntimeException("Can't figure out numeric token
> '" +
> text + "'");
>
> String prefix = text.substring(0, m.start());
> String suffix = text.substring(m.start());
>
> queueUp( new CommonToken(prefix.matches("^[0-9]+$") ? DIGITS :
> NUMERIC, prefix) );
> emit(new CommonToken( PUNC, suffix ));
> }
>
> // Queue to hold additional tokens
> private java.util.Queue<Token> tokenQueue = new
> java.util.LinkedList<Token>();
>
> // Include queue in reset().
> public void reset() {
> super.reset();
> tokenQueue.clear();
> }
>
> // Queued tokens are returned before matching a new token.
> public Token nextToken() {
> return tokenQueue.isEmpty() ? super.nextToken() :
> tokenQueue.poll();
> }
>
> public void queueUp(Token t) {
> tokenQueue.add(t);
> }
> }
>
> cite : token+ EOF ;
> token : DIGITS | NUMERIC | PUNC ;
> WS : ( ' ' | '\t'| '\f' | '\n' | '\r' ) {skip();} ;
>
> fragment DIGIT : '0'..'9' ;
> NUMERIC : DIGIT (DIGIT | '-' | ',' | '.')* {fixNum($text);} ;
> ----------------------------------------------------
>
>
> --
> Ken Williams
> Sr. Research Scientist
> Thomson Reuters
> Phone: 651-848-7712
> ken.williams at thomsonreuters.com
>
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address
More information about the antlr-interest
mailing list