[antlr-interest] Changing/affecting the Lexer from the Parser?

Bernard Kaiflin bkaiflin.ruby at gmail.com
Sat Nov 10 13:05:41 PST 2012


Juancarlo,

let's go further. If the lexer produces a token FLOAT, that is because you
have written a rule

FLOAT : INT '.' INT  ... maybe with exponent

If you remove that rule, the lexer, when seeing an input 1.2 will only be
able to produce INT DOT INT, which corresponds to your needs as far as
ARRAY is concerned. For the real float numbers, you can programmatically
group them in a parser rule

float : INT '.' INT etc

On the other hand, you could change the rule

 array : 'ARR' '(' index '.' dimension ')' ;

to

 array : 'ARR' '(' FLOAT COLON INT ')' ;

to match the pre-generated token stream and programmatically split
$FLOAT.text into two parts : index and start (dimension).

I don't like to give advices without testing them before, but in this case
I don't have the infrastructure. Could you checkk this ?

Bernard


2012/11/10 Juancarlo Añez <apalala at gmail.com>

> Bernard,
>
> That is correct.
>
> I know that the solution is what you describe for Ruby. I wanted to know if
> someone had done the likes with ANTLR.
>
> -- Juancarlo
>
>
> On Sat, Nov 10, 2012 at 12:19 PM, Bernard Kaiflin
> <bkaiflin.ruby at gmail.com>wrote:
>
> > Yes, only the Ruby parser (the one I wrote by hand) knows if it is in the
> > middle of an expression and if the / is a division. If it is expecting an
> > atom, it knows that the / starts a regexp and can ask the lexer to rewind
> > and recreate a token with the whole regexp.
> >
> > If I well understand, you have a grammar
> >
> > array     : 'ARR' '(' index '.' dimension ')' ;
> > dimension : start ':' stop ;
> >
> > (index, start and stop are probably replaced by INT, by I give them names
> > for clarity). As the file is tokenized in advance, the lexer has created
> >
> > ARR or ID
> > LPAR
> > FLOAT
> > COLON
> > INT
> > RPAR
> >
> > instead of
> >
> > ARR or ID
> > LPAR
> > INT
> > DOT
> > INT
> > COLON
> > INT
> > RPAR
> >
> > And now the token stream mismatches the grammar. Before going further,
> > please tell me if it's correct.
> >
> >
> > 2012/11/10 Juancarlo Añez <apalala at gmail.com>
> >
> >> Bernard,
> >>
> >> On Sat, Nov 10, 2012 at 10:48 AM, Bernard Kaiflin
> >> <bkaiflin.ruby at gmail.com>wrote:
> >>
> >> > I still don't see the relationship between 2 ARR(1:5) ARR(1.2:4)
> >> ARR(1.#I:#J)
> >> > and a Python CommonTokenStream. Is it a special version of Natural ?
> Do
> >> > you have the specifications for this language ?
> >> >
> >>
> >> With the existing CommonTokenStream, the 1.2 in ARR(1.2:4) has been
> lexed
> >> as a float before the parser started, which is way before the parser
> gets
> >> to the expression. The Python CommonTokenStream bootstraps itself by
> >> tokenizing all input on the first call to any of the methods that
> return a
> >> token.
> >>
> >> I built the grammar for Natural from the reference material, which
> >> includes
> >> sort-of grammar descriptions.
> >>
> >> I think that a language like Ruby requires a parser-guided lexer. I've
> >> built some of those by hand before, and they are very efficient. But
> >> Natural's grammar was too big (~3000 lines) to try to approach it by
> hand.
> >>
> >> Cheers,
> >>
> >> --
> >>
> >> Juancarlo *Añez*
> >>
> >> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> >> Unsubscribe:
> >> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
> >>
> >
> >
>
>
> --
> Juancarlo *Añez*
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>


More information about the antlr-interest mailing list