[antlr-interest] setting, altering text in lexer rules
Terence Parr
parrt at cs.usfca.edu
Mon Jun 12 11:34:49 PDT 2006
Hi,
I'm finalizing how lexers emit tokens and possibly alter the text for
tokens. I thought I'd run the ideas past you to make sure I'm not
crazy (about this anyway!) :)
There are a few main things I want to be able to do:
1. emit more than one token per lexer.nextToken() invocation (needed
for python etc...)
2. allow you to modify the text of a text either with setText() or
text=foo;
3. allow ! on lexical rule elements to delete text elements from
token easily
4. I need the common case to be fast and I don't want to clutter up
the simple Lexer object
Re #1: Currently, there is an emit(Token) method that can be
overridden to have a queue; nextToken would then be overridden to
pull from the queue. When the user emits manually, the token rule
must avoid doing so. Sounds easy but what happens when one rule
calls another?
FLOAT : INT '.' INT ;
INT : '0'..'9'+ ;
How does INT know not to emit a token? Currently I check "if token!
=null emit" but that emits two tokens. So I think I need to add a
lexical rule nesting level var and a check "if token!=null &&
level==0 emit". That adds extra work and each rule would need to
have inc/dec code for the level. This infrastructure will still work
for multiple token emits as emit can just set token to the last token
emitted to prevent auto emitting.
Re #2: you need to be able to set the text of a token. There is the
notion of a single token being matched and so it's ok to have the
idea of "current text"; i.e., an instance var. A setText method can
alter a String and auto token emit can use this instead of indexes
into the charbuf if nonnull. Easy until you realize you might want
to modify it as a buffer of char. Then it must be a StringBuffer not
just a string...may be hard to avoid creating a buf for each
nextToken invocation. Still it's pretty straightforward.
Re #3. Now, we need automatic modification of a text buffer to build
up a string. Users should be careful doing manual alteration of a
buffer and using ! on elements. The first ! that is encountered even
if deeply nested in multiple lexer rules must create a buffer and
copy all text from start of token (in the char buffer) to the new
StringBuffer. At level==0, we need to add whatever other text
remains and we need to emit a token using either the modified buffer
or just pointing into the char buffer (for unmodified text).
Re #4. We end up with a complicated chunk of code at the end of each
lexer rule, which costs time/space:
// token rule postamble
level--;
if ( token==null && level==0 ) {
if ( _buf!=null ) {
_buf.append(input.substring(_start, getCharIndex()-1));
}
emit(type,line,charPosition,channel,start,getCharIndex
()-1);
}
To avoid creating a new StringBuffer each time, I could reuse and set
the length to 0 at the start of each nextToken. Otherwise I need to do
if ( _buf==null ) { _buf = new StringBuffer(); }
at each char reference or other element with a bang. So, i guess I
need "if ( _buf.length()>0 )" check not _buf==null check. I hate to
slow down the token matching speed for every token to set the length
(the StringBuffer.setLength() method actually requires a fair bit of
code).
Anybody wanna comment on the implementation?
Ter
More information about the antlr-interest
mailing list