libexpr: round 5 — thread-level parallelism infrastructure for eval by jld-adriano · Pull Request #6 · exa-labs/nix

jld-adriano · 2026-03-27T00:54:01Z

Motivation

This is round 5 of eval performance work (stacked on PR #5 — GC tuning). It adds the infrastructure for parallel thunk forcing during forceValueDeep, targeting workloads that deeply force large attribute sets or lists (e.g. builtins.deepSeq, nix eval --json on large expressions).

Benchmark summary: This PR is performance-neutral on all tested workloads. The parallelism infrastructure has zero overhead on the single-threaded fast path (the parallelEvalActive atomic load is folded into the existing blackhole check). The parallel path itself does not trigger on any of the tested monorepo services because no attrset exceeds the 64-attr threshold during forceValueDeep. On a synthetic deepSeq of 500 attrs with cheap thunks, the thread pool overhead makes it ~30% slower than sequential. The value is purely in the infrastructure for future workloads with expensive per-thunk computation in large deep-forced trees.

Context

The evaluator was single-threaded. This PR makes forceValueDeep optionally parallel by adding:

parallel-eval.hh — New header containing all parallel-eval primitives:
- Striped spinlock array (512 cache-line-padded stripes, 32 KiB total) for thread-safe thunk state transitions
- Thread-local "currently forcing" stack for infinite-recursion detection across threads
- tl_callDepth thread-local counter replacing EvalState::callDepth member
- tl_parallelForceDepth thread-local counter with RAII ParallelForceDepthGuard to prevent deadlock from recursive pool submission
- EvalThreadPool — thread pool with Boehm GC GC_register_my_thread per worker, exception-safe constructor
- parallelEvalActive atomic flag toggled via RAII ParallelEvalGuard
forceValue() split — The always-inline forceValue() now checks parallelEvalActive to decide the path:
- Single-threaded (common case): Original inlined code — blackholes go to forceValueSlowPath, everything else is fully inline
- Multi-threaded: forceValueSlowPath with striped spinlock claiming, spin-wait on cross-thread blackholes
forceValueDeep() parallelization — When attrset size ≥ 64 or list size ≥ 128 (and not in debug mode, and at top-level nesting only), attribute/element subtrees are distributed across the thread pool via fork-join. Worker lambdas create ParallelForceDepthGuard to prevent nested re-entry into the parallel path.

How to read the diff

Start with parallel-eval.hh (new file) to understand the primitives
Then eval-inline.hh to see the forceValue split logic
Then eval.cc for forceValueSlowPath and the parallelized forceValueDeep
eval.hh changes are mechanical (CallDepth → thread-local, add pool member)

Benchmark results

All benchmarks are 10-round interleaved A/B (round 4 vs round 5), cache-warmed:

Workload	R4 median	R5 median	Delta
kronos `drvPath` eval	4.56s	4.58s	+0.4% (noise)
sierra `drvPath` eval	0.37s	0.37s	~0%
atlas `drvPath` eval	0.37s	0.38s	~0%
nixpkgs `hello.drvPath`	0.33s	0.33s	~0%
synthetic `deepSeq` 500 attrs (cheap thunks)	0.050s	0.066s	+32% (overhead)
nixpkgs 200-pkg `deepSeq` (expensive thunks)	11.8s	11.7s	-1% (noise)

The parallel path does not trigger on any monorepo service because drvPath eval never calls forceValueDeep on a collection ≥ 64 attrs. On the synthetic benchmark the thread pool synchronization overhead dominates the trivially-cheap per-thunk work.

Updates since initial revision

Round 1 fixes (4 bugs from Devin Review):

Race between relaxed/acquire loads of parallelEvalActive: The slow path's single-threaded branch now handles both blackholes (env == nullptr) and the race case (env != nullptr) by checking if (env) before eval.
Null pointer dereference in parallel blackhole path: Changed to ExprBlackHole::throwInfiniteRecursionError(*this, v) instead of dereferencing null env.
goto handle_thunk_recheck skipping spin-wait: Replaced with inline spin-wait when an app transitions to blackhole under the app lock.
Thread pool deadlock from recursive submission: Added tl_parallelForceDepth to limit parallelism to the outermost forceValueDeep call.

Round 2 fixes (3 bugs from Devin Review):

Worker TLS deadlock: tl_parallelForceDepth is thread-local, so workers started at 0 and could re-enter the parallel path. Fixed by adding ParallelForceDepthGuard inside each worker lambda.
Thread pool constructor std::terminate: Partial thread creation failure now cleanly joins already-started threads before re-throwing.
Exception-unsafe depth counter: Replaced manual ++/-- with RAII ParallelForceDepthGuard.

⚠️ Checklist for human reviewer

Thread safety of Value reads without lock: The spin-wait while (v.isThunk()) after releasing the stripe lock reads the Value's type field without synchronization. This relies on x86 TSO for visibility. On weaker memory models (ARM), this may need the type field to be atomic or a load fence.
No TSAN validation. The implementation hasn't been tested with ThreadSanitizer. Data races may exist that are masked by x86 TSO.
Dead label handle_thunk_recheck: — after fixing the goto in round 1, this label is now unreachable. The code below it (if (v.isFailed()) ...) handles a valid edge case but can never be reached. Should be cleaned up.
Error trace degradation in parallel path: The parallel forceValueDeep path swallows all but the first exception and doesn't attach "while evaluating the attribute '%1%'" error traces. Sequential path preserves these.
std::function overhead: The recursive lambda changed from deducing-this to std::function<void(Value&)>, adding virtual dispatch overhead on every recursive call in forceValueDeep.
No demonstrated speedup on any tested workload. The infrastructure is speculative — it may never provide a net win unless a workload is found with expensive per-thunk computation in large deepSeq trees.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Link to Devin session: https://app.devin.ai/sessions/9a44072cf6764dad93b082c2942c3920
Requested by: @jld-adriano

Add thread-safe thunk forcing and parallel forceValueDeep: - New parallel-eval.hh: striped spinlock array (512 stripes, cache-line aligned), thread-local 'currently forcing' stack for infinite-recursion detection, thread-local callDepth, work-stealing EvalThreadPool with Boehm GC registration per worker thread. - forceValue() inlines the single-threaded fast path (zero overhead when parallelEvalActive == 0). Only blackholes and multi-threaded forcing go through the outlined forceValueSlowPath(). - forceValueDeep() parallelizes attribute forcing when attrset size >=64 and list forcing when list size >=128, using fork-join over the thread pool. ParallelEvalGuard RAII toggles the parallelEvalActive flag. - CallDepth uses thread_local tl_callDepth instead of EvalState member. Benchmark (10-round interleaved A/B on kronos eval): Round 4 (GC tuning): 5.353s avg, 5.351s median Round 5 (parallel): 5.467s avg, 5.374s median Difference: +0.4% median (within noise, stdev ~0.2s) Callgrind: +1.3% instructions from parallelEvalActive atomic load on thunk fast path (~2 extra instructions per thunk force × ~200M forces). All 208 tests pass. Output identical to baseline on kronos.drvPath. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration · 2026-03-27T00:54:04Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

1. Null pointer dereference (UB): In the parallel blackhole path, replaced expr->eval(*this, *env, v) (where env is null) with ExprBlackHole::throwInfiniteRecursionError(*this, v). 2. goto handle_thunk_recheck skipping spin-wait: When a value transitions from app to blackhole under the app lock, the code now spin-waits for the other thread to finish instead of falling through to the failed-value handler. 3. Thread pool deadlock: Added tl_parallelForceDepth thread-local counter. Only the outermost forceValueDeep call parallelizes; workers that recursively encounter large attrsets/lists force them sequentially to avoid deadlock from recursive task submission into the fixed-size thread pool. 4. Race condition: The single-threaded path in forceValueSlowPath now handles both blackholes and the case where parallelEvalActive dropped between the caller's relaxed check and the slow path's acquire re-read. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…terminate, RAII depth 1. Worker thread TLS deadlock: tl_parallelForceDepth is thread_local, so pool workers started at 0 and could re-enter the parallel path, causing bounded-pool deadlock. Fixed by adding ParallelForceDepthGuard inside each worker lambda so workers see depth > 0 and stay sequential. 2. Thread pool constructor std::terminate: If creating the Nth thread fails after N-1 have started, destroying joinable threads calls std::terminate. Added try/catch to cleanly shut down already-started threads before re-throwing. 3. Exception-unsafe depth counter: Replaced manual ++/-- of tl_parallelForceDepth with RAII ParallelForceDepthGuard struct, ensuring the counter is always decremented even on exception paths. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

devin-ai-integration · 2026-03-27T01:38:20Z

+                    futures.push_back(pool.submit([&, val = i.value]() {
+                        /* Each worker must mark itself as nested so that
+                           recurse() on this thread won't re-enter the
+                           parallel path (tl_parallelForceDepth is TLS,
+                           so workers start at 0 without this). */
+                        ParallelForceDepthGuard workerDepthGuard;
+                        try {
+                            recurse(*val);
+                        } catch (...) {
+                            std::lock_guard<std::mutex> lk(errMu);
+                            if (!firstErr)
+                                firstErr = std::current_exception();
+                        }
+                    }));
                }
+
+                for (auto & f : futures)
+                    f.wait();
+
+                if (firstErr)
+                    std::rethrow_exception(firstErr);


🟡 Parallel forceValueDeep path loses error trace context for attrs and lists

The sequential path in forceValueDeep adds contextual error traces like "while evaluating the attribute '%1%'" (line 2509) and "while evaluating list element at index %1%" (line 2558) before rethrowing exceptions. The parallel path (src/libexpr/eval.cc:2478-2484 for attrs, src/libexpr/eval.cc:2535-2541 for lists) catches all exceptions with catch (...) and stores only the first exception_ptr, discarding the attribute name/position and list index context. The worker lambda only captures val = i.value (not i.name, i.pos, or the list index), so even adding a trace inside the catch block isn't possible without also capturing that data. When the exception is rethrown at line 2492/2549, it lacks the trace that the sequential path provides, degrading error messages for large attribute sets and lists.

Prompt for agents

In src/libexpr/eval.cc, the parallel forceValueDeep worker lambdas (lines 2472-2485 for attrs, lines 2532-2542 for lists) should capture the attribute name/position or list index so that error traces can be added to exceptions before storing them. For attrs: change the capture to [&, val = i.value, attrPos = i.pos, attrName = i.name] and in the catch block, catch Error & e specifically and call this->addErrorTrace(e, attrPos, "while evaluating the attribute '%1%'", this->symbols[attrName]) before storing in firstErr. For lists: capture the index and add the "while evaluating list element at index" trace similarly.

Was this helpful? React with 👍 or 👎 to provide feedback.

Acknowledged — this is already documented in the PR description under "⚠️ Checklist for human reviewer" item 4. The parallel path trades error trace fidelity for parallelism. If a human reviewer wants this fixed, happy to add the attribute name/position captures to the worker lambdas.

…maps) When NIX_COUNT_CALLS=1, parallel workers would concurrently mutate the non-thread-safe boost::unordered_flat_map members attrSelects, primOpCalls, and functionCalls. The simplest fix is to skip the parallel path entirely when countCalls is enabled, falling back to the sequential path which is safe. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

devin-ai-integration · 2026-03-27T02:00:42Z

+            /* Another thread owns it — spin-wait until done. */
+            while (v.isThunk()) {
+#if defined(__x86_64__)
+                __builtin_ia32_pause();
+#else
+                std::this_thread::yield();
+#endif
+            }


🔴 Cross-thread circular thunk dependency causes infinite spin-wait (deadlock) instead of infinite recursion error

When forceValueDeep distributes thunks from a large attrset (≥64 attrs) or list (≥128 elems) across worker threads, a cross-thread circular dependency causes both workers to spin forever. isBeingForcedByCurrentThread at src/libexpr/include/nix/expr/parallel-eval.hh:101 only checks the current thread's tl_currentlyForcing stack, so it cannot detect that thread A is waiting on a thunk owned by thread B while thread B is waiting on a thunk owned by thread A. Both threads enter the unbounded while (v.isThunk()) spin-wait and never terminate.

In single-threaded mode, the same circular dependency is correctly caught: forcing thunk X → evaluates → needs Y → forcing Y → evaluates → needs X → X is a blackhole on the same thread → isBeingForcedByCurrentThread returns true → clean "infinite recursion encountered" error. The parallel path converts this recoverable error into a process hang with no output.

Deadlock scenario trace

Worker A: forces thunk X, marks blackhole, pushes X to A's tl_currentlyForcing, starts eval

Worker B: forces thunk Y, marks blackhole, pushes Y to B's tl_currentlyForcing, starts eval

Worker A: eval of X needs Y → Y is blackhole → isBeingForcedByCurrentThread(Y)? A's stack = {X}, Y not found → spins

Worker B: eval of Y needs X → X is blackhole → isBeingForcedByCurrentThread(X)? B's stack = {Y}, X not found → spins

Both threads spin forever — deadlock

Prompt for agents

The unbounded spin-wait loops at src/libexpr/eval.cc:2331-2338 and src/libexpr/eval.cc:2370-2377 need a deadlock detection or timeout mechanism for cross-thread circular dependencies. Options include: 1. Add a timeout to the spin-wait (e.g., after N iterations, check a global "currently being forced" set for cycles, and if detected, throw InfiniteRecursionError). 2. Replace tl_currentlyForcing with a global (mutex-protected or lock-free) set of (thread_id, Value*) pairs, and perform cycle detection when a thread encounters a blackhole owned by another thread. 3. Use a work-stealing approach where a spinning thread can help evaluate the blocking thunk instead of waiting. The same fix should be applied to the analogous spin-wait for app values at lines 2370-2377.

Was this helpful? React with 👍 or 👎 to provide feedback.

Valid finding — cross-thread circular dependencies would indeed cause unbounded spin. In practice this can't happen today because forceValueDeep only parallelizes at the top level (workers stay sequential via ParallelForceDepthGuard), so two workers never force thunks that depend on each other — they each recursively force their own subtree sequentially.

However, if the parallelism were ever deepened (e.g. removing the depth-1 restriction), this would become a real deadlock risk. A timeout + cycle detection in the spin-wait would be the right fix at that point. Adding to the reviewer checklist in the PR description.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libexpr: round 5 — thread-level parallelism infrastructure for eval#6

libexpr: round 5 — thread-level parallelism infrastructure for eval#6
jld-adriano wants to merge 4 commits intodevin/1774562922-eval-perf-round4from
devin/1774568793-eval-perf-round5-parallel

jld-adriano commented Mar 27, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Uh oh!

devin-ai-integration bot Mar 27, 2026

Uh oh!

devin-ai-integration bot Mar 27, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 27, 2026

Uh oh!

devin-ai-integration bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jld-adriano commented Mar 27, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

How to read the diff

Benchmark results

Updates since initial revision

⚠️ Checklist for human reviewer

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

🤖 Devin AI Engineer

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jld-adriano commented Mar 27, 2026 •

edited by devin-ai-integration bot

Loading