[antlr-interest] Matching ellipsis

Wed Nov 29 06:05:03 PST 2006

Hi Ter,

    Thanks very much for your answer. I did manage to solve it, let me 
tell you what I did and also pose an additional problem I have:

The rule I used in the end (I manage to find a previous post that I 
could shamelessly copy-and-modify) is:

T_FLOAT_LITERAL:
        ( i=T_INTEGER_LITERAL d=T_DOT
            ( ( T_INTEGER_LITERAL?)
            | ( r=T_DOT
                    ( ( /*empty*/
                            { i.setType(T_INTEGER_LITERAL);
                              emit(i);
                              d.setType(T_ELLIPSIS);
                              d.setText("..");
                              emit(d); } )
                    ) )
	    | (r=T_ELLIPSIS
                    ( ( /*empty*/
                            { i.setType(T_INTEGER_LITERAL);
                              emit(i);
                              d.setType(T_ELLIPSIS);
                              d.setText("."+r.getText());
                              emit(d); } )
                    ) )

            ) )
    	|   ( T_DOT T_INTEGER_LITERAL)
    ;

However this needs some additional code to be added:

@lexer::members {
    // maximum number of emit() calls inside any rule action
    private static final int MAX_EMIT_COUNT = 2;

    // buffer (queue) to hold the emit()'d tokens
    private Token [] myToken = new Token[MAX_EMIT_COUNT];
    private int add_idx = 0; // deposit emit token here
    private int next_idx = 0; // next token to be delivered to parser

    public void emit(Token t) {
        token = t; // set flag to avoid automatic emit() at end of rule.
        myToken[add_idx++] = t;
    }

    public Token nextToken() {
        while (true) {
            if ( add_idx == next_idx ) {
                token = null;
                add_idx = 0;
                next_idx = 0;
                tokenStartCharIndex = getCharIndex();
                if ( input.LA(1)==CharStream.EOF ) {
                    return Token.EOF_TOKEN;
                }
                try {
                    mTokens();
                }
                catch (RecognitionException re) {
                    reportError(re);
                    recover(re);
                }
            } else {
                Token result = myToken[next_idx++];
                if ( result != Token.SKIP_TOKEN ) { // discard SKIP tokens
                    return result;
                }
            }
        }
    }
}

At least I believe that If I don't add this code, only the last token is 
really emitted (???) and this aligns with what was indicated in the post 
I refer too (actually the code was copied from that post)

Is this still applicable? or can I just happily emit(). Am I doing too 
many unneeded things in the rule?

I have another problem in this language (this is an existing language 
written in yacc and lex I'm trying to create a scanner for, btw).  The 
language defines an identifier as being able to contain '!' characters, 
but not as the last or the first character of the identifier. I.e 
abc!xyz should produce a T_IDENT(abc!xyz) token, while abc!=xyz should 
produce T_IDENT(abc); T_BANG_EQUAL(!=); T_IDENT(xyz).

After different attempts I decided to use a similar approach than the 
ellipsis case I showed above. However the problem was the interaction 
with Keywords sort of this (i.e. if!(a && B) should produce 
KW_IF(if);T_BANG(!);T_LPAREN(();T_IDENT(a);T_AND(&&);T_IDEN(B);T_RPAREN()).

So far my solution has been to explicitly test for the "if" string to 
emit token KW_IF or T_IDENT, but I wonder if there's some magic I could 
perform to implement this in a more uniform manner:

T_IDENT: 
	ALPHANUMERIC+
	;

T_IDENT_BANG:
	i=T_IDENT (T_BANG T_IDENT)* b=T_BANG (
		(T_IDENT)=> s=T_IDENT {
			i.setType(T_IDENT);
			i.setText(getText());
			emit(i);
		  }
		| (T_EQUAL)=> s=T_EQUAL {
			if(i.getText().equals("if")) {
				i.setType(KW_IF);
			} else if (i.getText().equals("elif")) {
				i.setType(KW_ELIF);
			} else {
				i.setType(T_IDENT);
			}
			emit(i);
			b.setType(T_BANG_EQUAL);
			b.setText(b.getText()+s.getText());
			emit(b);
		}
		| { 
			if(i.getText().equals("if")) {
				i.setType(KW_IF);
			} else if (i.getText().equals("elif")) {
				i.setType(KW_ELIF);
			} else {
				i.setType(T_IDENT);
			}
			emit(i);
			b.setType(T_BANG);
			emit(b);
		  }
	) 
	;

Thanks very much in advance and sorry for the long e-mail.
Best regards

    Julian

Terence Parr escribió:
> Hi Julian,
>
> just call emit() multiple times within a lexer rule :)
>
> Ter
> On Nov 24, 2006, at 2:12 AM, Julian Santander wrote:
>
>> Folks,
>>
>>     I'd be very grateful if someone could provide me some guidance on 
>> a problem I'm having. This is actually my first attempt at a parser 
>> using ANTLR. I'm using ANTLR beta 5 with Java generation (on a 
>> Windows XP machine and Java 1.5 if that matters).
>>
>>     At the lexical level I need to match tokens like '.' (dot) '..'  
>> (ellipsis), integers and floating point numbers. (Actually I don't 
>> need sign nor exponential formats)....
>>
>>     So far one of my many attempts has been:
>> T_INTEGER_LITERAL: DIGIT+; DOT: ('.' (('.')=>{false}? | ))=> '.' ; // 
>> This one is copied from 
>> http://www.antlr.org/blog/antlr3/lookahead.tml May 2006 post. 
>> T_ELLIPSIS: '.' '.'+ ; T_FLOAT_LITERAL: DIGIT+ DOT DIGIT* | DOT DIGIT+ ;
>> But so far I'm unable to parse "1..2" into T_INTEGER_LITERAL, 
>> T_ELLIPSIS, T_INTEGER_LITERAL.
>>
>> for example: "... .. 1..2 3...4 5.0 .6 7." renders: TOKEN: 
>> T_ELLIPSIS[@-1,0:2='...',<180>,1:0] TOKEN: WS[@-1,3:3=' 
>> ',<168>,channel=99,1:3] TOKEN: T_ELLIPSIS[@-1,4:5='..',<180>,1:4] 
>> TOKEN: WS[@-1,6:6=' ',<168>,channel=99,1:6] TOKEN: 
>> T_FLOAT_LITERAL[@-1,7:8='1.',<181>,1:7] TOKEN: 
>> T_FLOAT_LITERAL[@-1,9:10='.2',<181>,1:9] TOKEN: WS[@-1,11:11=' 
>> ',<168>,channel=99,1:11] TOKEN: 
>> T_FLOAT_LITERAL[@-1,12:13='3.',<181>,1:12] TOKEN: 
>> T_ELLIPSIS[@-1,14:15='..',<180>,1:14] TOKEN: 
>> T_INTEGER_LITERAL[@-1,16:16='4',<175>,1:16] TOKEN: WS[@-1,17:17=' 
>> ',<168>,channel=99,1:17] TOKEN: 
>> T_FLOAT_LITERAL[@-1,18:20='5.0',<181>,1:18] TOKEN: WS[@-1,21:21=' 
>> ',<168>,channel=99,1:21] TOKEN: 
>> T_FLOAT_LITERAL[@-1,22:23='.6',<181>,1:22] TOKEN: WS[@-1,24:24=' 
>> ',<168>,channel=99,1:24] TOKEN: 
>> T_FLOAT_LITERAL[@-1,25:26='7.',<181>,1:25] TOKEN: 
>> WS[@-1,27:27='\n',<168>,channel=99,1:27]
>> I've tried other things (I've seen a post on emitting multiple tokens 
>> for the same rule, but was apparently not yet supported in v3, I've 
>> seen also the pascal examples for v2, but somehow I couldn't get them 
>> to work??)
>>
>> Thanks very much in advance and best regards
>>
>>     Julian
>>
>>
>>
>>
>
>

-- 
Julian Santander 
IN Application Development		   Tel: +34 91714 9145
Lucent Technologies Spain	<mailto:jsantander at lucent.com>
Avenida De Bruselas 8, Alcobendas,  		  28108  Spain