Skip to content

Add parallel gtest flake reproduction runner#14827

Open
xingbowang wants to merge 6 commits into
facebook:mainfrom
xingbowang:gtest-parallel-repro
Open

Add parallel gtest flake reproduction runner#14827
xingbowang wants to merge 6 commits into
facebook:mainfrom
xingbowang:gtest-parallel-repro

Conversation

@xingbowang

@xingbowang xingbowang commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add tools/gtest_parallel_repro.py to reproduce gtest flaky tests that depend on process-level CPU contention and scheduler delays.
  • Run many fresh gtest processes concurrently, isolate each process with its own TEST_TMPDIR, and optionally pin the workload to a smaller CPU set with taskset.
  • Record per-run logs, write failures.jsonl, summarize failure keys, support stopping on first failed batch, and optionally rebuild with COERCE_CONTEXT_SWITCH=1.
  • Document when to use the runner in CLAUDE.md for CI-style flaky tests that single-process --gtest_repeat or coerce mode alone does not reproduce.
  • Using the new runner, we found and fixed one flaky test and improved c_test harness
    • Flaky test: EnvPosixTest.ReadAsyncQueueFull simulated a full io_uring submission queue by overwriting the SQE pointer after io_uring_get_sqe() had already consumed a submission slot. That left stale state in the thread-local ring, so later AbortIO tests could process an unexpected completion and crash. Add a skip_io_uring_get_sqe syncpoint before the io_uring_get_sqe() call and use it from the test so the Busy path is exercised without mutating the ring.

Test Plan

  • CI

Add tools/gtest_parallel_repro.py for reproducing flakes that depend on process-level CPU contention and scheduler delays. The runner launches fresh gtest processes in parallel, isolates TEST_TMPDIR per process, optionally pins them to a small CPU set, summarizes failures, and can rebuild with COERCE_CONTEXT_SWITCH=1.

Document the runner in CLAUDE.md so agents use it when single-process --gtest_repeat or COERCE_CONTEXT_SWITCH alone does not reproduce a CI-style flaky test.
@meta-cla meta-cla Bot added the CLA Signed label Jun 6, 2026
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 2 warning(s) on changed lines

Completed in 924.9s.

Summary by check

Check Count
bugprone-argument-comment 2
Total 2

Details

file/random_access_file_reader_test.cc (1 warning(s))
file/random_access_file_reader_test.cc:114:21: warning: argument name 'direct_io_buffer' in comment does not match parameter name 'direct_io_buffer_context' [bugprone-argument-comment]
table/block_fetcher.cc (1 warning(s))
table/block_fetcher.cc:359:23: warning: argument name 'direct_io_buffer' in comment does not match parameter name 'direct_io_buffer_context' [bugprone-argument-comment]

@meta-codesync

meta-codesync Bot commented Jun 6, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107758097.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 246fe5b


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 246fe5b


Summary

A well-structured standalone Python tool for reproducing CI-specific gtest flakes. The core design is sound, but there are process cleanup gaps on interruption and a misleading runtime env var that is actually compile-time only.

High-severity findings (2):

  • [tools/gtest_parallel_repro.py:~140-160] No SIGINT/signal handler: Ctrl+C leaves orphaned child processes running (they are in separate process groups via start_new_session=True).
  • [tools/gtest_parallel_repro.py:~140-155] File descriptor leak if subprocess.Popen() raises after log_fh = log_path.open("w") -- opened log files are never closed on the exception path.
Full review (click to expand)

Findings

🔴 HIGH

H1. No signal handler -- orphaned child processes on Ctrl+C — tools/gtest_parallel_repro.py
  • Issue: Child processes are created with start_new_session=True, placing them in their own process groups. If the user presses Ctrl+C during a batch, KeyboardInterrupt propagates through the main process but no cleanup kills the children. They continue running indefinitely.
  • Root cause: No signal.signal(signal.SIGINT, ...) or try/except KeyboardInterrupt with cleanup logic.
  • Suggested fix: Wrap the batch loop in a try/except KeyboardInterrupt that iterates over all active processes and calls terminate_process(), then closes all log file handles. Alternatively, register a signal handler that performs cleanup.
H2. File descriptor leak on Popen failure — tools/gtest_parallel_repro.py:run_batch()
  • Issue: In the process-spawning loop, log_fh = log_path.open("w") opens a file, then subprocess.Popen(...) may raise (e.g., binary not found, permission denied, resource exhaustion). The opened log_fh is never closed because the exception exits the function before reaching the cleanup loop at the bottom.
  • Root cause: No try/except or context manager around the open+Popen pair.
  • Suggested fix: Use a try/except around Popen() that closes log_fh on failure, or restructure to close all opened file handles in a finally block.

🟡 MEDIUM

M1. COERCE_CONTEXT_SWITCH set at runtime is misleading — tools/gtest_parallel_repro.py:main()
  • Issue: The script sets base_env["COERCE_CONTEXT_SWITCH"] = "1" for test processes at runtime. However, COERCE_CONTEXT_SWITCH is a compile-time-only flag (#ifdef COERCE_CONTEXT_SWITCH in C++ source, -DCOERCE_CONTEXT_SWITCH via Makefile). Setting it in the environment at runtime has zero effect on already-compiled binaries.
  • Root cause: The flag is correctly used during --build (passed to make), but is also unconditionally added to base_env which affects the test execution environment. The compile-time nature means this runtime setting is a no-op.
  • Suggested fix: Only set COERCE_CONTEXT_SWITCH in base_env when args.build is true (it's already set in build_env). Remove the runtime base_env setting, or add a comment explaining it has no runtime effect and is present only for documentation/metadata purposes.
M2. Elapsed time measurement inaccuracy — tools/gtest_parallel_repro.py:run_batch()
  • Issue: elapsed = time.monotonic() - float(info["start"]) is computed during the result-processing loop, not when the process actually finishes. For a batch with many processes finishing at different times, the reported elapsed_sec includes time spent processing other results.
  • Root cause: No completion timestamp is recorded when proc.poll() returns non-None.
  • Suggested fix: Record info["end_time"] = time.monotonic() when a process is detected as finished in the polling loop, and use that for elapsed calculation.
M3. TOTAL_RUNS calculation is fragile — tools/gtest_parallel_repro.py:main()
  • Issue: print(f"TOTAL_RUNS={args.jobs * (batch if 'batch' in locals() else 0)}") uses a 'batch' in locals() guard that is never actually needed (the loop always executes at least once since args.batches >= 1). More importantly, if the loop is interrupted by an exception mid-batch, batch reflects the last attempted batch, but that batch's processes may not have all completed, making TOTAL_RUNS inaccurate.
  • Root cause: Coupling the total count to the loop variable rather than tracking completions.
  • Suggested fix: Use args.jobs * args.batches for the normal path, or track completed_batches explicitly. Remove the dead 'batch' in locals() fallback.
M4. terminate_process() final proc.wait() has no timeout — tools/gtest_parallel_repro.py:terminate_process()
  • Issue: After SIGKILL, the final proc.wait() has no timeout. While SIGKILL is normally immediate, a process in uninterruptible I/O (D state) can hang indefinitely, blocking the entire tool.
  • Root cause: Missing timeout on the final wait.
  • Suggested fix: Add proc.wait(timeout=10) with a try/except to continue if even SIGKILL doesn't work.

🟢 LOW / NIT

L1. No macOS equivalent for taskset
  • Issue: taskset is Linux-specific. On macOS, the --cpus/--cpu-count options will fail with a clear error ("taskset not in PATH"), which is acceptable. However, the tool could note this limitation.
  • Suggested fix: Add a note in the help text or docstring that CPU pinning requires Linux.
L2. shutil.rmtree on Python < 3.12 follows symlinks in some edge cases
  • Issue: shutil.rmtree(run_dir) could follow symlinks if a test creates them in TEST_TMPDIR. On Python 3.12+, rmtree defaults to not following symlinks. On older versions, it may.
  • Suggested fix: Consider shutil.rmtree(run_dir, ignore_errors=True) or onerror for robustness, though the risk is low since these are test-created directories.
L3. Test plan is minimal
  • Issue: The PR test plan is only py_compile and --help. For a process-management tool, this provides no confidence in the core functionality (process spawning, timeout, cleanup).
  • Suggested fix: Add at least a basic smoke test that runs a trivial binary (e.g., /bin/true) with --jobs=2 --batches=1 and verifies the output structure.
L4. Missing copyright header
  • Issue: Other Python files in tools/ (e.g., db_crashtest.py) have copyright headers. This file is missing one.
  • Suggested fix: Add the standard RocksDB copyright header.

Cross-Component Analysis

Context Relevant? Notes
COERCE_CONTEXT_SWITCH YES Compile-time only; runtime env var is no-op (M1)
macOS YES taskset unavailable; clear error produced (L1)
Resource exhaustion LOW RISK 100 concurrent processes could hit FD limits on constrained systems, but this is a developer tool, not production
Disk space LOW RISK Many log directories created; acceptable for a debugging tool

Positive Observations

  • Clean separation of concerns (build, run, collect, report)
  • Good use of start_new_session=True for process group isolation
  • Thoughtful TEST_TMPDIR isolation per process
  • JSONL output format is well-designed for downstream tooling
  • The --stop-on-failure flag is a useful optimization
  • Failure key extraction (gtest failures, sanitizer errors, assertions) is comprehensive
  • The build_if_requested warning when --coerce-context-switch is used without --build is a nice touch
  • CLAUDE.md documentation is clear and well-placed in the existing flakiness section

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Running tools/gtest_parallel_repro.py with multiple fresh processes exposed two independent test bugs.

First, c_test built most filesystem paths from geteuid(), and remote_compaction_null_callback_handling used the fixed rocksdb_c_test_null_service path. Parallel processes from the same user could therefore share directories and fail opening LOCK with Resource temporarily unavailable. Add a process-unique test id on POSIX and use it for the C API test paths, including the null-service DB path.

Second, EnvPosixTest.ReadAsyncQueueFull simulated a full io_uring submission queue by overwriting the SQE pointer after io_uring_get_sqe() had already consumed a submission slot. That left stale state in the thread-local ring, so later AbortIO tests could process an unexpected completion and crash. Add a skip_io_uring_get_sqe syncpoint before the io_uring_get_sqe() call and use it from the test so the Busy path is exercised without mutating the ring.

Validation: rebuilt c_test and env_test with COERCE_CONTEXT_SWITCH=1; reran c_test for 20 parallel processes, the queue-full/AbortIO env sequence for 20 parallel processes, full env_test for 8 parallel processes, and the full 219-binary repro sweep. Slow timeout artifacts passed with longer timeout or gtest sharding. Also ran format-auto, check-sources, and git diff --check.
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 9bd41f0


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 9bd41f0


Summary

This PR adds a parallel gtest flake reproduction runner, fixes an io_uring ring corruption bug in the ReadAsyncQueueFull test, and improves c_test path isolation for parallel execution. The approach is sound, follows established codebase patterns (boolean skip sync points are used in table builders), and has zero performance impact in release builds.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Python script lacks unit tests — tools/gtest_parallel_repro.py
  • Issue: RocksDB has a convention of testing Python tools (e.g., tools/db_crashtest_test.py tests tools/db_crashtest.py, with a dedicated Makefile target db_crashtest_tests). The new 445-line script has no corresponding test file.
  • Root cause: Missing test convention adherence.
  • Suggested fix: Add tools/gtest_parallel_repro_test.py with unit tests for argument parsing, resolve_cpu_list(), extract_failure_keys(), parse_env(), and process management logic. Add a Makefile target.

🟢 LOW / NIT

L1. GetTestId() Windows branch includes ThreadID — db/c_test.c:35
  • Issue: On Windows, GetTestId() returns geteuid() which internally combines GetCurrentProcessId() << 16 | GetCurrentThreadId(). If c_test were ever called from multiple threads, different threads would produce different test IDs. On Linux, the function uses (euid << 32) ^ pid which is thread-stable.
  • Root cause: Asymmetric Windows/Linux implementations.
  • Suggested fix: Not blocking since c_test.c is single-threaded, but for consistency the Windows path could use return (uint64_t)GetCurrentProcessId(); directly.
L2. Retained old sync point — env/io_posix.cc
  • Issue: The old PosixRandomAccessFile::ReadAsync:io_uring_get_sqe sync point is retained after the new skip_io_uring_get_sqe point. If a future test uses the old pattern (set sqe=nullptr after call), the same ring corruption could recur.
  • Suggested fix: Add a comment near the retained sync point warning against the pattern of overwriting sqe to nullptr. Consider removing it in a follow-up if no test needs it.
L3. null_service test used relative path — db/c_test.c:4953
  • Issue: The pre-fix code used "rocksdb_c_test_null_service" as a relative path, creating the DB in the current working directory instead of the temp directory. The fix correctly moves it to GetTempDir()/GetTestId().
  • Note: This is a positive fix, not a problem with the PR.

Cross-Component Analysis

The io_uring ring corruption root cause is well-understood: ReadAsync, MultiRead, and AbortIO share a thread-local io_uring instance. The old ReadAsyncQueueFull test consumed an SQ slot via io_uring_get_sqe() then overwrote the pointer to nullptr, leaving a stale submission that could cause subsequent AbortIO tests to process unexpected completions. The new approach skips the io_uring_get_sqe() call entirely via a boolean flag, preventing any ring state mutation.

Context Code executes? Assumptions hold? Action needed?
Release builds YES (sync points are no-ops) YES None
Non-io_uring platforms NO (early return) N/A None
WritePreparedTxnDB NO (sync points unrelated) N/A None
Multi-threaded test runners YES YES (thread-local ring) None

Positive Observations

  • The boolean skip sync point pattern matches established precedent in block_based_table_builder.cc and plain_table_builder.cc.
  • TEST_SYNC_POINT_CALLBACK compiles to nothing in release builds -- verified zero overhead.
  • The c_test isolation fix using (euid << 32) ^ pid is simple and effective for parallel execution.
  • Process management in the Python script is robust: uses start_new_session=True for process group isolation, os.killpg() for group termination, and proper SIGTERM-then-SIGKILL escalation with timeout handling.
  • The Python script includes useful features: failure key extraction via regex, JSONL output, histogram summarization, and --stop-on-failure for interactive use.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync

meta-codesync Bot commented Jun 6, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107758097.

@meta-codesync

meta-codesync Bot commented Jun 8, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107758097.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 35c9fcb


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 35c9fcb


Summary

This PR correctly fixes an io_uring ring corruption bug in the ReadAsyncQueueFull test, improves c_test.c directory uniqueness for parallel execution, and adds a well-structured flake reproduction tool. The core fix is sound, with unanimous approval across all review dimensions.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

No high-severity findings.

🟡 MEDIUM

M1. Missing copyright/license header in gtest_parallel_repro.py
  • Issue: The new Python tool at tools/gtest_parallel_repro.py has no copyright or license header. Every other Python tool in tools/ has one (e.g., db_crashtest.py, write_stress_runner.py, ldb_test.py).
  • Root cause: Omission during file creation.
  • Suggested fix: Add the standard RocksDB dual-license header after the shebang:
    #!/usr/bin/env python3
    #  Copyright (c) Meta Platforms, Inc. and affiliates.
    #  This source code is licensed under both the GPLv2 (found in the
    #  COPYING file in the root directory) and Apache 2.0 License
    #  (found in the LICENSE.Apache file in the root directory).
M2. Consider simplifying GetTestId() -- euid adds no uniqueness in parallel testing
  • Issue: The formula ((uint64_t)geteuid() << 32) ^ (uint64_t)getpid() includes geteuid(), but all parallel test processes run under the same user, making euid a constant. Uniqueness depends entirely on getpid(). The XOR with a constant is a bijection on PID values so it's functionally correct, but unnecessarily complex.
  • Root cause: Defensive over-engineering.
  • Suggested fix: Consider using (uint64_t)getpid() directly, or document why euid is included (e.g., to also differentiate across users in shared temp directories).

🟢 LOW / NIT

L1. Windows GetTestId() produces cosmetically large numbers
  • Issue: On Windows, geteuid() returns ((int)GetCurrentProcessId() << 16) | (int)GetCurrentThreadId(), which can be negative when PID > 32767. Casting a negative int to uint64_t produces values near UINT64_MAX (e.g., 20-digit numbers). This makes directory names ugly but is not a correctness bug -- the 200-char buffers can accommodate up to 20-digit numbers with ~50 chars margin.
  • Suggested fix: No action required. If cosmetics matter, consider masking: return (uint64_t)(unsigned int)geteuid(); on Windows.
L2. null_db path changes from relative to absolute
  • Issue: Previously null_db was the hardcoded relative path "rocksdb_c_test_null_service" (creating the DB in CWD). The PR changes it to GetTempDir() + GetTestId(). This is actually an improvement (consistent with all other test paths, avoids CWD pollution, supports parallel execution). The PR also adds an snprintf overflow check (CheckCondition) which other paths lack.
  • Suggested fix: None needed. This is a positive change.
L3. No unit tests for Python tool helper functions
  • Issue: Functions like extract_failure_keys(), parse_env(), and resolve_cpu_list() have no unit tests. This is acceptable for a test infrastructure tool but could improve maintainability.
  • Suggested fix: Consider adding tests in a follow-up if the tool sees broad adoption.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB YES (via table readers) YES (sync points are no-op in release) None
ReadOnly DB YES (heavy table reading) YES None
CompactionService YES (reads SST files) YES None
User-defined timestamps No direct impact N/A None
MemPurge No async I/O N/A None
BlobDB YES (via table readers) YES None

Ring corruption mechanism verified:

  1. io_uring_get_sqe() advances the internal SQ tail pointer
  2. Nulling the returned sqe pointer does NOT undo the tail advancement
  3. Subsequent io_uring_submit() processes the stale/uninitialized SQE slot
  4. The thread-local ring is shared across all tests in the TestAsyncRead fixture
  5. The new skip_io_uring_get_sqe approach prevents this by skipping the call entirely

Production impact: NONE. TEST_SYNC_POINT_CALLBACK is #defined to nothing in release builds (#ifdef NDEBUG). The compiler eliminates the bool skip_get_sqe variable and the if (!skip_get_sqe) branch entirely via dead code elimination and constant propagation.

Positive Observations

  • Elegant fix: The skip_io_uring_get_sqe pattern cleanly separates test injection from ring manipulation. Retaining the old sync point inside the if block preserves backward compatibility.
  • Thorough c_test.c update: All 10 snprintf calls using geteuid() are updated, plus null_db gets a proper temp-dir path with overflow checking.
  • Well-designed Python tool: gtest_parallel_repro.py uses start_new_session=True for process group management, proper SIGTERM/SIGKILL escalation, isolated TEST_TMPDIR per process, subprocess.Popen with argument lists (no shell=True), and structured JSONL output.
  • Zero performance impact in release builds confirmed by compiler optimization analysis.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync

meta-codesync Bot commented Jun 8, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107758097.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b108c3f


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b108c3f


Summary

Solid test infrastructure PR that fixes a real io_uring SyncPoint bug, improves c_test parallel isolation, and adds a useful flake reproduction tool. The io_uring fix is correct and well-motivated. The c_test changes are mechanically sound but have a minor Windows-specific concern (pre-existing). The Python tool is well-structured.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Windows GetTestId() is a no-op for parallel isolation -- db/c_test.c:35
  • Issue: On Windows, GetTestId() returns (uint64_t)geteuid() without XORing with getpid(). The Windows geteuid() stub already embeds PID, so this likely works, but the asymmetry with Unix is undocumented.
  • Suggested fix: Add a comment explaining why Windows doesn't need the getpid() XOR.
M2. No unit tests for gtest_parallel_repro.py helpers -- tools/gtest_parallel_repro.py
  • Issue: 481-line tool with regex-based parsing (extract_failure_keys) has no unit tests. extract_failure_keys in particular is fragile.
  • Suggested fix: Add a test_gtest_parallel_repro.py for the parsing helpers.
M3. null_db path behavioral change -- db/c_test.c:4956
  • Issue: Old null_db was a literal string relative to CWD; new code uses GetTempDir(). This is actually an improvement making it consistent with all other test paths -- just noting the behavioral change.

🟢 LOW / NIT

L1. Windows geteuid() PID overflow (pre-existing) -- db/c_test.c:27
  • ((int)GetCurrentProcessId() << 16) overflows if PID > 65535. Pre-existing, not introduced by this PR.
L2. Consider documenting old SyncPoint preservation -- env/io_posix.cc:1124
  • Both skip_io_uring_get_sqe and io_uring_get_sqe SyncPoints exist; a brief comment explaining backward compat would help.

Cross-Component Analysis

The io_uring fix is verified correct: when skip_get_sqe=true, sqe stays nullptr, and the existing null check returns IOStatus::Busy before reaching io_uring_prep_readv. No state machine violation. In release builds, TEST_SYNC_POINT_CALLBACK compiles away and the branch is dead-code-eliminated.

Positive Observations

  • The io_uring fix addresses a genuine bug where calling io_uring_get_sqe() then nullifying the pointer left stale SQ state affecting subsequent AbortIO tests.
  • The null_db path now includes a truncation check via CheckCondition.
  • The Python tool has proper process group management, signal handling, and cleanup.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107758097.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 1e5cfc6


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 1e5cfc6


Summary

This PR adds a well-designed parallel gtest flake reproduction tool, fixes a real flaky test bug in ReadAsyncQueueFull, and improves c_test.c path isolation for concurrent execution. The io_uring sync point fix is the most important change -- the old approach left stale io_uring state that could corrupt later tests. The c_test changes and Python tool are straightforward and well-motivated. Overall, a solid PR with one medium issue to address.

High-severity findings (0):
No high-severity findings.

Medium-severity findings (1):

  • [db/c_test.c:35] GetTestId() on Windows returns only (uint64_t)geteuid() without incorporating getpid(), unlike the Unix path. Since the Windows geteuid() already mixes GetCurrentProcessId() and GetCurrentThreadId(), this may be intentional, but the asymmetry with the #else branch is surprising and the logic differs from the non-Windows case (no PID XOR).
Full review (click to expand)

Findings

🟡 MEDIUM

M1. Windows GetTestId() asymmetry -- db/c_test.c:35
  • Issue: On non-Windows, GetTestId() computes ((uint64_t)geteuid() << 32) ^ (uint64_t)getpid(), combining user ID and process ID. On Windows, it returns just (uint64_t)geteuid(). The Windows geteuid() (line 24-31) already combines GetCurrentProcessId() << 16 | GetCurrentThreadId(), so PID is present. However, the Windows geteuid() uses only 32 bits of range (two 16-bit values), and the function returns int, which may truncate. Casting this int to uint64_t doesn't extend the entropy. This is likely fine in practice since GetCurrentProcessId() is already embedded, but the code structure is confusing -- a reader would expect the Windows branch to also explicitly incorporate PID as the non-Windows branch does.
  • Root cause: Platform-specific geteuid() on Windows already embeds PID, making the explicit ^ getpid() unnecessary. But the asymmetry makes the code harder to reason about.
  • Suggested fix: Add a comment in the #ifdef OS_WIN branch explaining that geteuid() already incorporates GetCurrentProcessId(), so no separate PID mixing is needed. Alternatively, make both branches structurally similar for clarity.

🟢 LOW / NIT

L1. null_db behavioral change -- db/c_test.c:4953
  • Issue: The null_db variable changed from a string literal "rocksdb_c_test_null_service" (a relative path) to GetTempDir() + "/rocksdb_c_test_null_service-<id>" (an absolute path under the temp directory). The old code created the DB in the CWD, which could leave artifacts and conflict with parallel runs. The new behavior is strictly better.
L2. No unit tests for gtest_parallel_repro.py
  • Issue: The 627-line Python tool has no test suite. Helper functions like extract_failure_keys, parse_env, resolve_cpu_list are readily unit-testable. Consider adding in a follow-up.
L3. Old sync point preserved but unused -- env/io_posix.cc
  • Issue: The old sync point name PosixRandomAccessFile::ReadAsync:io_uring_get_sqe is preserved inside the if (!skip_get_sqe) block but no test references it anymore. Harmless dead code -- consider removing in a follow-up cleanup.

Cross-Component Analysis

Context Affected? Analysis
Thread-local io_uring YES Fix correctly avoids consuming an SQ slot. io_uring is thread-local (IORING_SETUP_SINGLE_ISSUER), no cross-thread concerns.
Parallel test execution YES GetTestId() and gtest_parallel_repro.py both target this. UID+PID combination ensures path uniqueness.
Production builds NO TEST_SYNC_POINT_CALLBACK compiles to nothing with NDEBUG. Zero performance impact.

Positive Observations

  • The io_uring fix correctly addresses the root cause: overwriting the SQE pointer after io_uring_get_sqe() consumed a slot left stale submissions in the ring.
  • The Python tool has proper signal handling, process group management (start_new_session=True + os.killpg), and cleanup.
  • The null_db snprintf bounds check is correct defensive coding.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant