[antlr-interest] Multiple lexer tokens per rule
Ken Williams
ken.williams at thomsonreuters.com
Fri Jun 4 09:02:14 PDT 2010
On 6/3/10 5:36 PM, "Junkman" <j at junkwallah.org> wrote:
> Try this to get you started: [...]
Thanks, that's a good start. There's still some bookkeeping I'm not
getting, though. I seem to have to queue them in the reverse order that I
want them out - in the Lexer I do 'queueUp(tok1); emit(tok2);' and then in
nextToken() I return the queued token first. But then for some reason I get
the tokens in the sequence 'tok2 tok1'.
It seems like maybe somewhere in the generated code, something¹s accessing
tokens directly in the state¹ member, or something¹s getting confused by
using index¹, or something like that.
My complete [toy] grammar is below. When I use it, I get the following
results:
23 -> DIGITS *good*
23, -> PUNC DIGITS *bad*
23,450 -> NUMERIC *good*
23,450, -> PUNC NUMERIC *bad*
----------------------------------------------------
grammar testg;
options { backtrack=true; memoize=true; output=AST; }
tokens { PUNC; DIGITS; }
@lexer::header{
package com.tr.research.cites;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
}
@parser::header{ package com.tr.research.cites; }
@lexer::members {
protected Pattern trailingPunc = Pattern.compile("[^0-9]+$");
protected void fixNum(String text) {
if (text.matches("^[0-9]+$")) { emit(new CommonToken(DIGITS, text));
return; }
if (text.matches("^.*[0-9]+$")) { emit(new CommonToken(NUMERIC,
text)); return; }
Matcher m = trailingPunc.matcher(text);
if (!m.find())
throw new RuntimeException("Can't figure out numeric token '" +
text + "'");
String prefix = text.substring(0, m.start());
String suffix = text.substring(m.start());
queueUp( new CommonToken(prefix.matches("^[0-9]+$") ? DIGITS :
NUMERIC, prefix) );
emit(new CommonToken( PUNC, suffix ));
}
// Queue to hold additional tokens
private java.util.Queue<Token> tokenQueue = new
java.util.LinkedList<Token>();
// Include queue in reset().
public void reset() {
super.reset();
tokenQueue.clear();
}
// Queued tokens are returned before matching a new token.
public Token nextToken() {
return tokenQueue.isEmpty() ? super.nextToken() : tokenQueue.poll();
}
public void queueUp(Token t) {
tokenQueue.add(t);
}
}
cite : token+ EOF ;
token : DIGITS | NUMERIC | PUNC ;
WS : ( ' ' | '\t'| '\f' | '\n' | '\r' ) {skip();} ;
fragment DIGIT : '0'..'9' ;
NUMERIC : DIGIT (DIGIT | '-' | ',' | '.')* {fixNum($text);} ;
----------------------------------------------------
--
Ken Williams
Sr. Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.williams at thomsonreuters.com
More information about the antlr-interest
mailing list