[antlr-interest] greedy vs nongreedy lexer rules

Sun Apr 18 16:50:41 PDT 2010

You wrote:

Well, imagine that you are modifying "global" state as you match characters
in identifier; this is something done in actions that ANTLR can analyze.
There is no way to "roll this back".

The 'global' state you are talking about is state the DFA is modifying but
which an action in a rule could examine?  Is the amount of state visible to
actions too much to store while alternatives are being evaulated?

On Sun, Apr 18, 2010 at 4:28 PM, Terence Parr <parrt at cs.usfca.edu> wrote:

>
> On Apr 18, 2010, at 4:18 PM, Cliff Hudson wrote:
>
> > Hmm, it seems to me that there should be a way to record the set of
> > recursive actions and appropriate pointers into the lexed string such
> that
> > you can replicate the logical state of the DFA when you execute the
> actions
> > once an alternative is definitely selected.  Is there more to the state
> of
> > the system at any given possible action point than a pointer to the start
> of
> > the current substring, its length, and maybe a pointer to the already
> > matched token stream?  I am possibly out of my depth here on
> understanding
> > how the lexing system really works.  Or I have not adequately explained
> my
> > idea :)
>
> hiya. :)  Well, imagine that you are modifying "global" state as you match
> characters in identifier; this is something done in actions that ANTLR can
> analyze. There is no way to "roll this back".  ANTLR can only provide the
> information you specified about where it is and what character it was
> parsing--it can't deal with other member variables that you have updated.
>  Plus tracking all that information for input character could get expensive
> even if it was limited to that.
>
> I'm leaning towards a trivial NFA implementation (Thompson's algorithm from
> the 60s as described by Russ Cox)
>
> http://swtch.com/~rsc/regexp/regexp2.html
>
> because it allows us to save partial matches like id=ID very easily, unlike
> a DFA. Unfortunately, we really need identifiers and keywords to be
> efficient because that's what most input streams consist of. I'm not sure I
> could get the NFA to go that fast compared to a DFA. On the other hand, the
> new NFA-based mechanism would likely be faster than the current v3 mechanism
> which is guaranteed to recognize the first few characters of any token
> twice. so, in the end I should go for the smallest implementation and the
> simplest that gives us the capabilities. That would spell Thompson's
> NFA-based algorithm. There's even a chance that I could make it go faster
> using two or three threads instead of just one to do the NFA simulation.
>
> choices choices
>
> Ter
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>