Skip to content

Fix deadlock between concurrent query worker and main thread#9436

Open
khanaffan wants to merge 10 commits into
masterfrom
affank/os-deadlock
Open

Fix deadlock between concurrent query worker and main thread#9436
khanaffan wants to merge 10 commits into
masterfrom
affank/os-deadlock

Conversation

@khanaffan

@khanaffan khanaffan commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

imodel-native: iTwin/imodel-native#1463

Fix uses a dedicated connection that shared with all workers to prepare ECSQL statement. Previously we used primary connection that caused the dead lock. Performance test were added to see if we degraded or improved performance of concurrent query. From below it is clear that 4 thread which is default we improved by 1.1% (queries/sec)

ECSqlReader concurrent-query performance: A vs B (final)

A = original addon (9cd010a6) · B = new addon with serialization fix (4742fee6).
Throughput in queries/sec (higher = better), mean of 2 runs/phase; range spans maxConc 16 and 64.
All runs: 0 errors, rows=847828.

workers A B (fixed) Δ
1 182–183 182–184 ~0%
2 307–313 306–309 ≤1.4%
4 410–427 415–426 ≤1.1%
8 228–239 230–238 ≤1.1%

- Updated performance test for ECSqlReader to include new query options.
- Adjusted expected prepare time in ConcurrentQuery tests for accuracy.
- Added new performance test configuration for ECSqlReader with multiple classes.
- Created new performance results documentation for ECSqlReader concurrent queries.
"perftest:schemaloader": "npm run -s perftest:pre && mocha --timeout=999999999 \"./lib/cjs/perftest/SchemaLoader.test.js\"",
"perftest:ecSqlRowPerformance": "npm run -s perftest:pre && mocha --timeout=999999999 \"./lib/cjs/perftest/ECSqlRow.test.js\"",
"perftest:readQueryPerformance": "npm run -s perftest:pre && mocha --timeout=999999999 \"./lib/cjs/perftest/ReadQueryPerf.test.js\"",
"perftest:ecSqlReaderConcurrent": "npm run -s perftest:pre && mocha --timeout=999999999 \"./lib/cjs/perftest/ECSqlReaderConcurrentPerf.test.js\"",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run them as part of some pipeline as well? Also, do we want to generate an artifact to record the performance numbers so that we can see how performance number change in future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It be great to track performance.

Comment thread core/backend/src/test/ecdb/ECDb.test.ts Outdated
Comment thread core/backend/src/test/ecdb/ConcurrentQuery.test.ts Outdated
@rschili

rschili commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The PR description is stale and still describes the old approach, you should update it.

The PR bundles several independent things, could have been split, but I approved, with respect to the still pending comments from others.

const median = runs[Math.floor(runs.length / 2)];
const minQps = runs[0].throughputQps;
const maxQps = runs[runs.length - 1].throughputQps;
totalErrors.push(...median.errorSamples);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only asserts errors from the median run after sorting by throughput. If a non-median repeat has query failures, the test can still pass as long as the median repeat is clean, which makes the harness look more reliable than it is.

Can we aggregate errors from every timed repeat separately from choosing the median result for perf reporting?

— Nambot 🤖 (powered by GPT-5.5)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real bug. totalErrors.push(...median.errorSamples) collects error samples from only the median repeat after the runs.sort(). If any other repeat has query errors, runWorkload captures them in its errorSamples, but they're never pushed to totalErrors, so the assert.strictEqual(totalErrors.length, 0) at line 422 can pass despite errors in non-median runs. The assert's message ("all generated ECSQL statements should execute without error") is then misleading.

Suggest aggregating from all runs:

for (const r of runs)
  totalErrors.push(...r.errorSamples);

This is a perf harness so it won't block merge, but worth a fix before someone trusts the zero-error pass.


// A single seed iModel populated with every configured class so cross-class, polymorphic and
// join queries all return data.
const fileName = `ECSqlReaderConcurrentPerf_seed_${config.elementsPerClass}x${config.classNames.length}.bim`;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seed cache key is too weak. It only includes element count and class count, so changing classNames, schema assumptions, or seed-building logic with the same counts will silently reuse an old .bim via the early return below.

Can we include the configured class names/config identity in the seed filename, or validate the existing seed before reusing it?

— Nambot 🤖 (powered by GPT-5.5)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — changing classNames while keeping the same count would silently reuse a stale seed. A simple fix is to include the names in the filename: classNames.join("-") or a short hash of the config. Low risk for a manually-run tool where the developer controls the output directory, but worth noting as a footgun if the config evolves. Can be a follow-up.

@aruniverse aruniverse added this to the iTwin.js 5.11 milestone Jun 29, 2026

@aruniverse aruniverse left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the TypeScript-side changes. All good to merge once the native package from imodel-native#1463 lands.

ECSqlStatement.test.ts — Moving PRAGMA validate_ecsql_writes=true from createQueryReader to withWriteStatement is the right fix. The pragma must run on the primary write connection; via CCQ it silently no-ops on a throwaway worker connection. The added comment explains this clearly (see inline).

ECDb.test.ts — Trailing comma only. The runDbListPragmaCCQ behavior (pragmas executing on the schema-source/data-source connection) is validated on the native side in ECSqlPragmasTests.cpp.

ConcurrentQuery.test.tscloseTo(0, 4) tolerance is fine.

Issue #9451 — Properly scoped. The fix here (using the right API) is correct; making CCQ actively reject write-affecting PRAGMAs with an error is a separate infrastructure improvement. Good call filing it explicitly.

Changelog"type": "none" is accurate; the behavior fix lives in the native PR and this PR contains only test corrections.

Perf test in CI (naveedkhan8067's question) — worth a follow-up ticket to wire perftest:ecSqlReaderConcurrent into a pipeline and capture baseline artifacts, but not a blocker here.

Comment thread core/backend/src/test/ecdb/ECSqlStatement.test.ts
khanaffan and others added 2 commits June 29, 2026 15:28
The deadlock fix prepares the first query against a cold shared schema-source
connection, so prepareTime is higher and more variable on CI (observed 7ms
against a 0 +/- 4ms tolerance). Assert only that prepareTime is negligible
relative to the ~1000ms execution instead of pinning a tight tolerance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@khanaffan khanaffan requested a review from a team as a code owner June 29, 2026 19:59

@aruniverse aruniverse left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed commits 41626a7bab and f844e0b39f.

doNotUsePrimaryConnToPrepare deprecation (41626a7bab) — well done. The @deprecated in 5.11.0 JSDoc in ConcurrentQuery.ts is clear, the core-common.api.md snapshot is updated, both core-backend and core-common have changenotes, and the flag is removed from the test config. ✅

prepareTime bound (f844e0b39f) — lessThan(100) is lenient but the comment explains exactly why (cold schema-source connection, observed 7ms against a 0 +/- 4 bound). The assertion still has value as a sanity guard against catastrophically slow prepares, and the narrative explanation in the comment is honest. ✅

One issue on the perf harness (see inline on line 382): totalErrors.push(...median.errorSamples) only collects errors from the median repeat — errors in other repeats are silently dropped before the assert.strictEqual(totalErrors.length, 0) check. Worth a one-liner fix before relying on that assertion for correctness gating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants