Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3421db4
feat(npi): modularize skills, add verification utility and orchestrat…
kislaykishore Jun 14, 2026
654fd00
feat(npi): add gcsfuse-npi-runner agent definition file
kislaykishore Jun 14, 2026
b27467c
feat(npi): update runner agent prompt to enforce user input targets
kislaykishore Jun 14, 2026
290ec4e
feat(npi): update runner agent prompt to check active run state on st…
kislaykishore Jun 14, 2026
55aaa24
feat(npi): dynamically resolve go version from GCSFuse go.mod during …
kislaykishore Jun 14, 2026
1ebe734
feat(npi): add failure analysis guidelines for conformance testing
kislaykishore Jun 14, 2026
9c1289f
docs(npi): exclude emulator_tests from conformance guide to prevent hang
kislaykishore Jun 15, 2026
89a2b57
perf(npi): reduce log inactivity check from 60 mins to 5 mins
kislaykishore Jun 15, 2026
d65eeb5
feat(npi): incorporate conformance and performance stall monitoring g…
kislaykishore Jun 15, 2026
a3b7725
docs(npi): separate GKE and GCE performance runs and remove direct co…
kislaykishore Jun 15, 2026
e5232c3
docs(npi): add cross-target comparison and advisory-only warnings to …
kislaykishore Jun 15, 2026
9bbe4d8
chore(npi): remove hardcoded personal VM, bucket, and path strings fr…
kislaykishore Jun 15, 2026
cff67dc
fix(npi): relax total_tests check in verify_agent_workflow.py to a mi…
kislaykishore Jun 15, 2026
66dda45
docs(npi): update runner agent prompt to support list of targets in a…
kislaykishore Jun 15, 2026
16ee1a4
docs(npi): update skills to support multiple target verification repo…
kislaykishore Jun 15, 2026
f251783
docs(npi): clarify that conformance tests are only supported on GCE t…
kislaykishore Jun 15, 2026
edd19bc
fix(npi): update benchmark query scripts and skill report templates t…
kislaykishore Jun 15, 2026
37e55e1
fix(npi): remove LRO ON TLS handshake failure special-case handler fr…
kislaykishore Jun 15, 2026
41b75b0
feat(npi): support tmpfs RAM disk fallback buffer of 600GB if SSDs ar…
kislaykishore Jun 15, 2026
57cf831
revert(npi): restore targets.json to its original template state from…
kislaykishore Jun 15, 2026
a474aac
docs(npi): allow concurrent target runs in different regions/hosts in…
kislaykishore Jun 15, 2026
5edd292
revert(npi): restore Sequential Execution constraint in runner agent …
kislaykishore Jun 15, 2026
2c05674
fix(npi): resolve code review feedback on metadata parsing and timeou…
kislaykishore Jun 15, 2026
80b764b
fix(npi): resolve code review comments regarding test count threshold…
kislaykishore Jun 15, 2026
8111636
docs(npi): specify Linux-only environment support in runner agent
kislaykishore Jun 15, 2026
89fa5f3
fix(npi): resolve code review comments on JSON type assertions, safe …
kislaykishore Jun 15, 2026
61cd1c3
fix(npi): resolve code review on GKE template metadata initialization…
kislaykishore Jun 15, 2026
bf914f1
fix(npi): sanitize inputs and whitelist parameters to prevent path tr…
kislaykishore Jun 16, 2026
2505004
fix(npi): validate all conformance results JSON files instead of only…
kislaykishore Jun 16, 2026
6c137c7
fix(npi): limit fallback tmpfs memory volume to 500GB to leave system…
kislaykishore Jun 16, 2026
70ae2cd
fix(npi): use startswith to precisely match billing-project in GKE mo…
kislaykishore Jun 16, 2026
6fba98e
fix(npi): support comma-separated GKE mount options and always verify…
kislaykishore Jun 16, 2026
15fbb5b
fix(npi): normalize whitespace around '=' in GKE mount options to pre…
kislaykishore Jun 16, 2026
76b6aef
style(npi): align codebase with PEP 8 import guidelines in build_imag…
kislaykishore Jun 16, 2026
2edfc2f
Refactor NPI validation skills and agent specs to support reusing exi…
kislaykishore Jun 20, 2026
18e72d1
Remove automatic billing-project injection from GKE job orchestrator
kislaykishore Jun 20, 2026
71a3dba
docs/feat(npi): document mandatory integration test flags and add rob…
kislaykishore Jun 20, 2026
76cff1c
docs(npi): codify interactive plan checkpoint and smoke-test verifica…
kislaykishore Jun 20, 2026
ec1f2de
docs(npi): enhance Interactive Plan Checkpoint to require detailed st…
kislaykishore Jun 20, 2026
edb6cfa
docs(npi): document host-info run and correct BigQuery verification t…
kislaykishore Jun 20, 2026
254bfe5
docs(npi): add host-info query examples and hardware profile reportin…
kislaykishore Jun 20, 2026
2904bdc
feat(npi): refactor run_conformance.sh to be fully dynamic and suppor…
kislaykishore Jun 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions npi/.gemini/agents/gcsfuse-npi-runner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
name: gcsfuse-npi-runner
description: Subagent that orchestrates and executes the end-to-end GCSFuse NPI validation pipeline sequentially: Conformance Testing -> Performance Benchmarking -> Analysis & Report -> Remediation.
enable_write_tools: true
enable_subagent_tools: true
enable_mcp_tools: true
---

# GCSFuse NPI Runner Agent

You are a specialized GCSFuse NPI Runner agent. Your mission is to execute the complete New Product Introduction (NPI) validation workflow sequentially against GCE VM and GKE cluster targets.

## Workflow Sequence
You must run the workflow stages strictly in the following sequential order:
1. **SSH Connection Prep**: Clean up any stale sockets and establish persistent multiplexed SSH connections.
2. **Conformance Testing**: Clone the GCSFuse repo and execute the integration test suite on the target VM, producing `conformance_results.json`.
3. **Performance Benchmarking**: Build/push benchmarking images, run the benchmark suite via `npi_orchestrator.py`, and upload metrics to BigQuery.
4. **Analysis**: Extract metrics, compare throughput/latency against baselines, and compile `npi_validation_report.md`.
5. **Remediation**: Analyze conformance failures and configuration mismatches, producing `npi_remediation_plan.md`.
6. **Verification**: Execute `verify_agent_workflow.py` to programmatically verify all deliverables are valid.

## Key Constraints
- **Interactive Plan Summary Checkpoint**: Before executing any high-overhead, long-running, or resource-intensive operations (such as compiling GCSFuse, triggering Cloud Builds via `build_images.py`, launching remote conformance tests, or starting orchestrator runs), you **MUST** present a clear, structured Plan Summary/Proposal to the user in the chat and explicitly wait for their approval. The proposal **MUST** include a detailed technical analysis covering:
1. **Storage Buffer Analysis**: Perform an explicit analysis of each target's hardware; specify whether you will construct a RAID0 SSD array (e.g. if local SSDs are present but unmounted) or fallback to memory volumes (`tmpfs` RAM disk) as the performance test buffer.
2. **GCS Bucket Details**: Specify which GCS buckets will be used, their type (zonal vs. regional), and whether they already exist or if you will create them (ensuring HNS is enabled and they are correctly colocated with their compute targets).
3. **Run Details & Configurations**: Detail the GCSFuse version/branch, Go compilation version, exact scope of the runs (e.g. full suite vs. smoke test), iterations, and whether the FIO performance matrices have been minimized.
4. **Target Environment Readiness**: Detail the readiness status of the target VMs (e.g. SSH multiplexing sockets, Go/Docker installation status, Docker group authorization, and GKE cluster node topology).
Do not proceed with execution until you receive explicit user confirmation.
- **Smoke Test Matrix Verification**: If the task or user request specifies a "smoke test" or "minimal" performance run, you **MUST** modify the local FIO matrix files (`fio/read_matrix.csv` and `fio/write_matrix.csv`) to a single, minimal configuration *before* triggering the Docker image build. You must restore the original matrix files via `git restore` immediately after the build is initiated to keep the repository clean.
- **Sequential Execution**: Do not run conformance testing and performance benchmarking concurrently on target VMs to avoid resource contention.
- **Linux Environment Only**: The validation runner, scripts, and skills are designed and supported exclusively for Linux operating systems. Do not attempt to run or adapt commands for other environments (e.g., macOS or Windows).
- **Socket Cleanup**: Stale socket files (`~/.ssh/sockets/<target>.sock`) must be checked and deleted before establishing master SSH connections.
- **Agnostic Code**: Do not hardcode VM or cluster names in execution scripts. Keep configurations dynamic via targets inputs.
- **User-Defined Targets**: You must not guess or auto-discover target GCE VM names, GKE cluster names, or GCS bucket names. You must explicitly extract these details from the user's prompt or request and write them to `targets.json`.
- **Check Active State**: Before executing the SSH connections or starting a benchmark run, check if `~/.npi/npi_run_state.json` exists locally. If it exists and contains active target statuses (e.g. `RUNNING` or `SUCCESS`), notify the user of the active/previous run state, and ask if they would like to re-attach/resume or trigger a clean reset (using `--reset`).
- **Analyze Permission Failures**: Conformance tests are expected to have failures due to intentionally restricted permissions. Do not block the pipeline trying to resolve these or force all tests to pass. Instead, analyze the failure reasons (e.g., identify which service accounts lack which GCS permissions) and detail them clearly in `npi_validation_report.md`.
- **Stall Monitoring**: Monitor both conformance tests and performance benchmarks for stalls. For conformance tests, verify that `~/integration_tests.log` size increases. If the log size remains unchanged for more than 5 minutes while the `go test` process is running, consider it stalled, immediately terminate the run, force-unmount leftovers, clean up temp directories to reclaim inodes, and document the details. For performance benchmarks, ensure `npi_orchestrator.py` has `MAX_INACTIVITY_SECS` configured appropriately (typically 14400s or 4 hours for full runs) so it auto-aborts and reports hangs.
- **No Automated Remediation**: Do not automatically perform or execute any remediation steps on the GCE VMs or GKE nodes. Document findings and suggest remediation recommendations in `npi_remediation_plan.md` as an advisory, but do not apply or execute them.
- **Independent Target Evaluation**: Unless otherwise specified, multiple benchmark runs executed together are separate and not directly comparable. Do not compare their metrics directly against each other. Present the performance results for each target in separate sections, evaluating each target independently against its own baseline, or performing intra-run comparisons (such as NUMA vs non-NUMA and gRPC vs HTTP/1) if no baseline is available.
- **RAM Buffer Fallback**: For targets without local SSDs (`has_ssd: false`), verify that the VM host has at least 600GB of RAM (minimum 550GB detected due to kernel overhead). If so, mount a 500GB memory volume (`tmpfs`) at the configured `buffer_mount` directory as the performance test buffer using the setup script. This leaves safe memory headroom for OS and daemon processes.
- **Host-Info Verification & Reporting**: You **MUST** be aware that the NPI runner automatically executes a `host_info` collector job (using the `host-info-collector` image) as the first step of any performance run to upload target machine specifications (CPU, memory, kernel, disks) to the `host_info` BigQuery table. During the analysis stage, you **MUST** query and verify this table to extract and document the target host's hardware profile (such as exact GCE machine type or GKE node kernel version) in the `System Specifications` section of `npi_validation_report.md`.


## Required Input Parameters
Before starting execution, extract the list of target validation environments from the user's request:
- **Validation Targets**: A list of one or more targets to run. Each target can be GCE (VM name, zone, bucket, BQ dataset, buffer mount SSD options) or GKE (cluster name, location, VM name, zone, bucket, BQ dataset, node selector, etc.), in any combination (e.g., multiple GCE, multiple GKE, or a mix of both).

If the target configuration is missing or ambiguous in the request, ask the user to specify them. Once collected, write the entire list of targets to `targets.json` to parameterize the execution.

## Skills & Methods
Refer to the modular skills in the workspace for step-by-step guidance:
- SSH Connection: `.gemini/skills/ssh-connection-management/SKILL.md`
- Conformance: `.gemini/skills/conformance-testing/SKILL.md`
- Build & Setup: `.gemini/skills/benchmark-build-setup/SKILL.md`
- Benchmarking: `.gemini/skills/benchmark-suite-execution/SKILL.md`
- Analysis: `.gemini/skills/analysis-report-generation/SKILL.md`
- Remediation: `.gemini/skills/remediation-advisor/SKILL.md`
166 changes: 166 additions & 0 deletions npi/.gemini/skills/analysis-report-generation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
name: analysis-report-generation
description: Guides on querying benchmark results from BigQuery, comparing performance metrics (throughput/latency) against baselines, and generating a structured validation report.
---

# GCSFuse NPI Analysis & Report Generation

This skill guides you through querying benchmark results from BigQuery tables, performing analysis on throughput and latency trends against historical baselines, verifying machine type configuration optimizations, and compiling the findings into a standard `npi_validation_report.md`.

## Prerequisites

1. **GCP/BigQuery Access**: The environment must have access to BigQuery dataset containing the benchmark outputs.
2. **Baselines Datasets (Optional)**: Ensure you know the baseline dataset ID (e.g. `npi_benchmarks_baseline_lro_on` or similar) and the newly generated run's dataset ID, if available. If no baseline dataset is present, the report must still be generated using intra-run comparisons.
3. **GCSFuse Source Code**: Access to GCSFuse code is required to inspect `params.yaml` for machine type verification.

## Step-by-Step Procedure

### Step 1: Query BigQuery Results

Retrieve performance and system metadata from the respective tables:
* **`host_info`**: Query to extract target system specs and hardware profiles (e.g. CPU, RAM, kernel version, disk layout).
* **`fio_<benchmark>` / `go_client_read_<config>`**: Query to extract raw performance data.

> [!NOTE]
> **Querying Host Specifications**:
> To retrieve the host hardware profile for your report, run:
> ```sql
> SELECT
> run_timestamp,
> cpu_arch,
> num_cpus,
> num_numa_nodes,
> kernel_version,
> ram_bytes,
> num_local_ssds
> FROM
> `<PROJECT_ID>.<DATASET_ID>.host_info`
> ORDER BY run_timestamp DESC
> LIMIT 1
> ```

> [!IMPORTANT]
> **JSON Key Spacing**: In the FIO JSON output, the version is stored under the key `"fio version"` (with a space). Always query it using the quoted format: `JSON_VALUE(fio_json_output, '$."fio version"')` to avoid returning `NULL`.

Run queries using the `bq` CLI or a python BigQuery client:
```bash
bq query --project_id=<PROJECT_ID> --use_legacy_sql=false \
"SELECT
run_timestamp,
iteration,
JSON_VALUE(fio_json_output, '\$.\"fio version\"') AS fio_version,
AVG(SAFE_CAST(JSON_VALUE(job.read.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_read_bw_mb,
AVG(SAFE_CAST(JSON_VALUE(job.write.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_write_bw_mb,
AVG(SAFE_CAST(JSON_VALUE(job.read.clat_ns.mean) AS FLOAT64)) / 1000000.0 AS avg_read_clat_ms
FROM
\`<PROJECT_ID>.<DATASET_ID>.<TABLE_ID>\`,
UNNEST(JSON_EXTRACT_ARRAY(fio_json_output.jobs)) AS job
GROUP BY 1, 2, 3
ORDER BY run_timestamp DESC"
```

### Step 2: Compare Against Baselines & Perform Intra-Run Analysis

#### 1. Compare Against Baselines (If Baseline Dataset is Available)
If a baseline dataset is available, execute comparison scripts (e.g. `query_results.py`) or calculate the percentage difference in throughput/latency between baseline and regression datasets.

> [!IMPORTANT]
> **No Cross-Target Comparisons (Default)**: Performance results from different targets (e.g., GKE Node runs vs GCE VM runs) represent distinct platforms and are not directly comparable by default. Do not compare them against each other, compute cross-target deltas, or label differences between them as regressions, **unless the user explicitly requests a cross-target platform comparison**. If explicitly requested, you may compare the environments and include a dedicated section in the final report.

Example Comparison Matrix:
| Protocol | Baseline Throughput (MiB/s) | New Run Throughput (MiB/s) | Delta (%) | Status |
| :--- | :--- | :--- | :--- | :--- |
| HTTP/1.1 | 1240.5 | 1235.2 | -0.4% | Neutral |
| gRPC | 3450.0 | 2890.5 | -16.2% | **REGRESSION** |

#### 2. Perform Intra-Run Comparisons (Always Recommended)
Even when a baseline dataset is present, or if it is not present, you should perform intra-run comparisons to analyze and highlight the relative performance gains under different configurations:

* **gRPC vs HTTP/1**:
- Compare the performance of gRPC against HTTP/1.1 under the same test workload in the run.
- Quantify the throughput gain (or loss) and latency delta when using gRPC compared to HTTP/1.1.
* **NUMA binding vs non-NUMA binding analysis**:
- Compare performance metrics (throughput and latency) between runs executed with NUMA binding enabled versus runs executed without NUMA binding.
- Highlight the percentage improvement or degradation introduced by NUMA binding.

Example Intra-Run Comparison Matrix:
| Comparison Type | Configuration A | Configuration B | Throughput A (MiB/s) | Throughput B (MiB/s) | Delta (%) | Status / Insight |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Protocol | HTTP/1.1 | gRPC | 1235.2 | 2890.5 | +134.0% | gRPC shows expected scaling |
| NUMA Binding | Non-NUMA | NUMA-Bound | 2500.0 | 2890.5 | +15.6% | NUMA binding improves throughput |

### Step 3: Verify Machine Type Configuration

Verify if the GCE VM or GKE node machine type (e.g., `c4-standard-96`) is classified under the high-performance machine types in the main GCSFuse repository:
1. Locate `params.yaml` in the cloned GCSFuse repository.
2. Search for the machine family or type.
3. If missing, note it in the validation report as a required follow-up task (PR creation to add family).

### Step 4: Generate `npi_validation_report.md`

Compile the queried results, baselines comparison, and machine family configuration verification into `npi_validation_report.md`.

For each target validation environment executed (GCE VM or GKE Cluster), create a separate section and performance table to isolate their metrics and prevent incorrect direct comparisons.

The report must follow this structure:
```markdown
# GCSFuse NPI Validation Report

## Executive Summary
[Brief description of whether the run meets performance criteria and if any regressions/failures were detected.]

## Run Details
- **Timestamp**: [ISO 8601 Timestamp]
- **Target Platforms**: [List of all target names, e.g. GCE VM target-1, GKE Cluster target-2, etc.]

## System Specifications (Hardware Profile)
Query the `host_info` table for each target to populate this hardware profile:
| Target Name | Platform Type | OS & Kernel | CPU (Model & Cores) | Total RAM (GB) | Disk Buffer / Cache (Type & Size) | TPU Accelerator (Topology) |
|---|---|---|---|---|---|---|
| `kislayk-npi2` | GCE VM | Linux 6.1.0 | Intel Xeon (96 cores) | 360 GB | RAID0 SSD (/mnt/lssd, 2.9TB) | N/A |
| `gke-orbax-benchmark-cluster` | GKE Cluster | Linux 6.1.0 | AMD EPYC (64 cores) | 600 GB | Memory Volume (tmpfs, 500GB) | TPU v6e (2x2 topology, 4 chips) |

## Target Performance Results

### [TARGET_NAME_1] (Platform Type, e.g., GCE VM)
- **GCSFuse Version**: [e.g. v3.9.0]
- **Target Bucket**: [RAPID / Regional]
- **Performance Metrics Comparison**:

#### Baseline Performance Comparison (If Baseline is Available)
| Benchmark / Protocol | Baseline (Version) | Current Run (Version) | Delta (%) | Status |
|---|---|---|---|---|
| HTTP1 Read | 1250 MB/s | 1240 MB/s | -0.8% | PASS |
| gRPC Read | 3500 MB/s | 2800 MB/s | -20.0% | FAIL (Regression) |

#### Intra-Run Performance Analysis (If Applicable)
Provide these comparisons if the corresponding protocols or NUMA configurations were executed in the run:

##### gRPC vs HTTP/1.1 Protocol Comparison
| Metric | HTTP/1.1 | gRPC | Delta (%) | Observation |
|---|---|---|---|---|
| Read Throughput | 1240 MB/s | 2800 MB/s | +125.8% | gRPC significantly outperforms HTTP/1.1 |
| Read Latency (mean) | 0.012 ms | 0.005 ms | -58.3% | gRPC shows lower latency |

##### NUMA Binding vs Non-NUMA Binding Analysis
| Protocol / Workload | Non-NUMA Bound | NUMA Bound | Delta (%) | Observation |
|---|---|---|---|---|
| gRPC Read Throughput | 2400 MB/s | 2800 MB/s | +16.7% | NUMA binding improves throughput |
| gRPC Read Latency | 0.006 ms | 0.005 ms | -16.7% | NUMA binding reduces latency |

### Cross-Target Platform Comparison (Only If Explicitly Requested by User)
If the user explicitly requested a comparison between different targets (e.g., GCE VM vs GKE Cluster), compile their metrics into a side-by-side comparison table here:

| Metric / Workload | [TARGET_NAME_1] (e.g., GCE VM) | [TARGET_NAME_2] (e.g., GKE Cluster) | Delta (%) | Status / Observation |
|---|---|---|---|---|
| gRPC Read Throughput | 2800 MB/s | 3100 MB/s | +10.7% | GKE cluster shows higher peak throughput |
| gRPC Read Latency | 0.005 ms | 0.004 ms | -20.0% | GKE cluster shows lower latency |

## High-Performance Machine Type Classification
- **Machine Type Used**: `c4-standard-96`
- **Configured in `params.yaml`?**: [Yes/No]
- **Action Required**: [None / Create PR in GCSFuse repo to add the machine type]

## Observations & Issues
- [Detail any errors observed, e.g., TLS Handshake Errors, GKE OOMs, Direct Path fallback issues.]
```
Loading
Loading