[antlr-interest] Lexer too quick to grab a token?

Sun May 1 16:19:02 PDT 2011

I have created an obnoxious grammar and need help lexing it.
Basically, a left-bracket plus a string represents an open tag, and
there's a matching close tag with a right bracket. If you really want
a bracket, you type the bracket twice.

To be concrete,

[/ this is text in a tag /]
should lex as
L_TAG(text="[/") ... tokens representing "this is text in a tag" ...
R_TAG(text="/]")

The problem comes when I want to explain this grammar using the grammar.

To put stuff in a tag, type [[/ stuff /]]
should lex as
... lots of tokens ... L_BRACKET(text="[[") ... tokens representing "/
stuff /" ... R_BRACKET(text="]]")

Unfortunately, I can't figure out how to keep the lexer from matching
"/]" as an R_TAG and then having the extra "]" left over.

Conceptually, what I'd like to do is say that R_TAG matches a
character of the appropriate type followed by ']', as long as there's
no ']' immediately after. If there is are two right brackets after the
character, the lexer should make those a R_BRACKET token and make the
first character a simple text token.

Does this make any sense? Is there some way to deal with it?

Thanks,
Todd