[antlr-interest] greedy vs nongreedy lexer rules

Sun Apr 18 16:28:17 PDT 2010

On Apr 18, 2010, at 4:18 PM, Cliff Hudson wrote:

> Hmm, it seems to me that there should be a way to record the set of
> recursive actions and appropriate pointers into the lexed string such that
> you can replicate the logical state of the DFA when you execute the actions
> once an alternative is definitely selected.  Is there more to the state of
> the system at any given possible action point than a pointer to the start of
> the current substring, its length, and maybe a pointer to the already
> matched token stream?  I am possibly out of my depth here on understanding
> how the lexing system really works.  Or I have not adequately explained my
> idea :)

hiya. :)  Well, imagine that you are modifying "global" state as you match characters in identifier; this is something done in actions that ANTLR can analyze. There is no way to "roll this back".  ANTLR can only provide the information you specified about where it is and what character it was parsing--it can't deal with other member variables that you have updated.  Plus tracking all that information for input character could get expensive even if it was limited to that.

I'm leaning towards a trivial NFA implementation (Thompson's algorithm from the 60s as described by Russ Cox)

http://swtch.com/~rsc/regexp/regexp2.html

because it allows us to save partial matches like id=ID very easily, unlike a DFA. Unfortunately, we really need identifiers and keywords to be efficient because that's what most input streams consist of. I'm not sure I could get the NFA to go that fast compared to a DFA. On the other hand, the new NFA-based mechanism would likely be faster than the current v3 mechanism which is guaranteed to recognize the first few characters of any token twice. so, in the end I should go for the smallest implementation and the simplest that gives us the capabilities. That would spell Thompson's NFA-based algorithm. There's even a chance that I could make it go faster using two or three threads instead of just one to do the NFA simulation. 

choices choices

Ter