[antlr-interest] greedy vs nongreedy lexer rules

Tue Apr 20 17:51:28 PDT 2010

On Tue, Apr 20, 2010 at 1:04 AM, Ron Burk <ronburk at gmail.com> wrote:
>> Out of curiosity; may i ask you why "function call per token" is
>> something bad for multi-core performance?
>
> That aspect is independent of multiple cores.
> If you implement an efficient lexer (e.g., a minimized
> DFA), then the overhead of function call tends to be
> a significant percentage of the per-token work of the
> lexer (based on my experience processing C/C++
> source -- YMMV wrt language, though I doubt by
> much).
>

I suppose, speaking of Java, that this method's size is decent enough
to be conveniently inlined. It can also be proven to be monomorphic in
most cases, I believe, so no virtual dispatch calls. Of course, it
depends and  cannot be taken for granted.

>> I am asking because the way
>> i see this issue is that pursuing two different paths in NFA has no
>> inherent coupling. It is simply matter of state replication and
>> merging it back to single outcome state upon entering some barrier
>> marking "end of work" state. Am I missing something?
>
> I don't think so. Once you're using an NFA for lexing, you'll
> likely be insanely less efficient than a minimized DFA,
> so the overhead of a function call per token should be
> unnoticed amidst the general slowness :-).

Are you thinking of costs of backtracking here?

 If you were
> trying to use multiple cores to follow paths in an NFA,
> then "merging it back" sure sounds like a place to stall
> on attempting to update memory that is shared between
> cores. I wouldn't predict multiple cores would be faster for
> that without measuring it first on the target CPU.

Actually, I don't think that this merging is going to be needed.
Pursuing every possible path till it collapses into final state or
error and may be sufficient. This is of course seriously undermined by
arbitrary user code that may be intertwined with lexer :-))

 If you
> haven't looked at how ugly things have gotten inside
> in the last few years, I recommend:
>
> http://www.infoq.com/presentations/click-crash-course-modern-hardware
>

Thx, I will take a look

> The root problem for getting anywhere close to the theoretically
> possible speeds of putting lexing on a different core from
> parsing is the constant stalling over the shared data (the
> tokens).

They actually do not "share" data. Lexer is simply providing the data
which means, in my opinion, that input can be fully tokenized and then
passed to parser.

-- 
Greetings
Marcin Rzeźnicki