[antlr-interest] lexer problem (BUG?)

Jim Idle jimi at temporal-wave.com
Fri Jul 27 08:09:17 PDT 2007


Not a bug it is just the way it spits out lexers unless you tell it not to in some way. You might find it easier to view it that your original spec just said that '<' was enough to predict the rule so the code just sets off down that rule when it sees '<'. So you need to say "when you see '<', then if you see x it I a y and a it is a b". It is just the way Ter decided that the lexer generation should work and in general will give a smaller fast lexer. Try:

script : SSTART ANY* SSTOP ;

JAVASCRIPT	: '<'
			(
				  ('script>')=>	'script>'		{ $type = SSTART; }
				| ('/script>')=>	'/script>'		{ $type = SSTOP;  }
				| 						{ $type = LT;	}
			)

ANY : . ;

// These just here to define a token type for $type
// as declarations in token {} will result in undefined token warnings
// at the moment.
//
fragment LT 	: '<' 		;
fragment SSTART	: '<script>'	;
fragment SSTOP	: '</script>'	;


You will soon find though that you will really need to keep state in the lexer and only return the tokens if certain states are satisfied, such as a START has been seen. But, if you only need to deal with <script> </script>, then that should be good enough for you.

Jim 

PS: This is straight from fingers to email, so you may find syntax errors ;-)

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Ruth Karl
> Sent: Friday, July 27, 2007 7:19 AM
> To: ANTR Interest
> Subject: Re: [antlr-interest] lexer problem (BUG?)
> 
> 
> 
> Ruth Karl schrieb:
> > Hi Andrew,
> >
> > thanks a lot for finding a smaller example to illustrate the problem.


> >>
> >> grammar jsp;
> >>
> >> JAVASCRIPT    :    '<script>' ( options {greedy=false;} : . )*
> >> '</script>' {System.out.print("J");};   ANY    :    .
> >> {System.out.print("A");};
> >>
> >> jsp        :    (ANY | JAVASCRIPT)* EOF;
> >>
> >> with input:
> >>
> >> <script>foo</script>
> >> <s>bar</s>
> >>
> >>
> >> Produces a token stream of:
> >> "<script>foo</script>", "a", "r", "<", "/", "s", ">"
> >>
> >> aka
> >>
> >> JAVASCRIPT, ANY, ANY, ANY, ANY, ANY, ANY
> >>
> >> Something vacuums up the "<s>b"


More information about the antlr-interest mailing list