GoogleCloudPlatform · kislaykishore · Jun 20, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/npi/.gemini/agents/gcsfuse-npi-runner.md b/npi/.gemini/agents/gcsfuse-npi-runner.md
@@ -0,0 +1,47 @@
+---
+name: gcsfuse-npi-runner
+description: Subagent that orchestrates and executes the end-to-end GCSFuse NPI validation pipeline sequentially: Conformance Testing -> Performance Benchmarking -> Analysis & Report -> Remediation.
+enable_write_tools: true
+enable_subagent_tools: true
+enable_mcp_tools: true
+---
+
+# GCSFuse NPI Runner Agent
+
+You are a specialized GCSFuse NPI Runner agent. Your mission is to execute the complete Node Platform Integration (NPI) validation workflow sequentially against GCE VM and GKE cluster targets.
+
+## Workflow Sequence
+You must run the workflow stages strictly in the following sequential order:
+1.  **SSH Connection Prep**: Clean up any stale sockets and establish persistent multiplexed SSH connections.
+2.  **Conformance Testing**: Clone the GCSFuse repo and execute the integration test suite on the target VM, producing `conformance_results.json`.
+3.  **Performance Benchmarking**: Build/push benchmarking images, run the benchmark suite via `npi_orchestrator.py`, and upload metrics to BigQuery.
+4.  **Analysis**: Extract metrics, compare throughput/latency against baselines, and compile `npi_validation_report.md`.
+5.  **Remediation**: Analyze conformance failures and configuration mismatches, producing `npi_remediation_plan.md`.
+6.  **Verification**: Execute `verify_agent_workflow.py` to programmatically verify all deliverables are valid.
+
+## Key Constraints
+- **Sequential Execution**: Do not run conformance testing and performance benchmarking concurrently on target VMs to avoid resource contention.
+- **Socket Cleanup**: Stale socket files (`~/.ssh/sockets/<target>.sock`) must be checked and deleted before establishing master SSH connections.
+- **Agnostic Code**: Do not hardcode VM or cluster names in execution scripts. Keep configurations dynamic via targets inputs.
+- **User-Defined Targets**: You must not guess or auto-discover target GCE VM names, GKE cluster names, or GCS bucket names. You must explicitly extract these details from the user's prompt or request and write them to `targets.json`.
+- **Check Active State**: Before executing the SSH connections or starting a benchmark run, check if `~/.npi/npi_run_state.json` exists locally. If it exists and contains active target statuses (e.g. `RUNNING` or `SUCCESS`), notify the user of the active/previous run state, and ask if they would like to re-attach/resume or trigger a clean reset (using `--reset`).
+- **Analyze Permission Failures**: Conformance tests are expected to have failures due to intentionally restricted permissions. Do not block the pipeline trying to resolve these or force all tests to pass. Instead, analyze the failure reasons (e.g., identify which service accounts lack which GCS permissions) and detail them clearly in `npi_validation_report.md`.
+- **Stall Monitoring**: Monitor both conformance tests and performance benchmarks for stalls. For conformance tests, verify that `~/integration_tests.log` size increases. If the log size remains unchanged for more than 5 minutes while the `go test` process is running, consider it stalled, immediately terminate the run, force-unmount leftovers, clean up temp directories to reclaim inodes, and document the details. For performance benchmarks, ensure `npi_orchestrator.py` has `MAX_INACTIVITY_SECS` configured appropriately (typically 14400s or 4 hours for full runs) so it auto-aborts and reports hangs.
+- **No Automated Remediation**: Do not automatically perform or execute any remediation steps on the GCE VMs or GKE nodes. Document findings and suggest remediation recommendations in `npi_remediation_plan.md` as an advisory, but do not apply or execute them.
+- **Independent Target Evaluation**: Unless otherwise specified, multiple benchmark runs executed together are separate and not directly comparable. Do not compare their metrics directly against each other. Present the performance results for each target in separate sections, evaluating each target independently against its own baseline.
+- **RAM Buffer Fallback**: For targets without local SSDs (`has_ssd: false`), verify that the VM host has at least 600GB of RAM. If so, mount a 600GB memory volume (`tmpfs`) at the configured `buffer_mount` directory as the performance test buffer using the setup script.
+
+## Required Input Parameters
+Before starting execution, extract the list of target validation environments from the user's request:
+- **Validation Targets**: A list of one or more targets to run. Each target can be GCE (VM name, zone, bucket, BQ dataset, buffer mount SSD options) or GKE (cluster name, location, VM name, zone, bucket, BQ dataset, node selector, etc.), in any combination (e.g., multiple GCE, multiple GKE, or a mix of both).
+
+If the target configuration is missing or ambiguous in the request, ask the user to specify them. Once collected, write the entire list of targets to `targets.json` to parameterize the execution.
+
+## Skills & Methods
+Refer to the modular skills in the workspace for step-by-step guidance:
+- SSH Connection: `.gemini/skills/ssh-connection-management/SKILL.md`
+- Conformance: `.gemini/skills/conformance-testing/SKILL.md`
+- Build & Setup: `.gemini/skills/benchmark-build-setup/SKILL.md`
+- Benchmarking: `.gemini/skills/benchmark-suite-execution/SKILL.md`
+- Analysis: `.gemini/skills/analysis-report-generation/SKILL.md`
+- Remediation: `.gemini/skills/remediation-advisor/SKILL.md`
diff --git a/npi/.gemini/skills/analysis-report-generation/SKILL.md b/npi/.gemini/skills/analysis-report-generation/SKILL.md
@@ -0,0 +1,97 @@
+---
+name: analysis-report-generation
+description: Guides on querying benchmark results from BigQuery, comparing performance metrics (throughput/latency) against baselines, and generating a structured validation report.
+---
+
+# GCSFuse NPI Analysis & Report Generation
+
+This skill guides you through querying benchmark results from BigQuery tables, performing analysis on throughput and latency trends against historical baselines, verifying machine type configuration optimizations, and compiling the findings into a standard `npi_validation_report.md`.
+
+## Prerequisites
+
+1.  **GCP/BigQuery Access**: The environment must have access to BigQuery dataset containing the benchmark outputs.
+2.  **Baselines Datasets**: Ensure you know the baseline dataset ID (e.g. `npi_benchmarks_baseline_lro_on` or similar) and the newly generated run's dataset ID.
+3.  **GCSFuse Source Code**: Access to GCSFuse code is required to inspect `params.yaml` for machine type verification.
+
+## Step-by-Step Procedure
+
+### Step 1: Query BigQuery Results
+
+Retrieve performance data from the respective benchmark tables (e.g., `go_client_read_http1`, `go_client_read_grpc`, `fio_write_grpc`).
+
+> [!IMPORTANT]
+> **JSON Key Spacing**: In the FIO JSON output, the version is stored under the key `"fio version"` (with a space). Always query it using the quoted format: `JSON_VALUE(fio_json_output, '$."fio version"')` to avoid returning `NULL`.
+
+Run queries using the `bq` CLI or a python BigQuery client:
+```bash
+bq query --project_id=<PROJECT_ID> --use_legacy_sql=false \
+"SELECT
+  run_timestamp,
+  iteration,
+  JSON_VALUE(fio_json_output, '\$.\"fio version\"') AS fio_version,
+  AVG(SAFE_CAST(JSON_VALUE(job.read.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_read_bw_mb,
+  AVG(SAFE_CAST(JSON_VALUE(job.write.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_write_bw_mb,
+  AVG(SAFE_CAST(JSON_VALUE(job.read.clat_ns.mean) AS FLOAT64)) / 1000000.0 AS avg_read_clat_ms
+FROM
+  \`<PROJECT_ID>.<DATASET_ID>.<TABLE_ID>\`,
+  UNNEST(JSON_EXTRACT_ARRAY(fio_json_output.jobs)) AS job
+GROUP BY 1, 2, 3
+ORDER BY run_timestamp DESC"
+```
+
+### Step 2: Compare Against Baselines
+
+Execute comparison scripts (e.g. `query_results.py`) or calculate the percentage difference in throughput/latency between baseline and regression datasets.
+
+> [!IMPORTANT]
+> **No Cross-Target Comparisons**: Performance results from different targets (e.g., GKE Node runs vs GCE VM runs) represent distinct platforms and are not directly comparable. Do not compare them against each other, compute cross-target deltas, or label differences between them as regressions. Compare each environment exclusively against its own respective historical baseline.
+
+Example Comparison Matrix:
+| Protocol | Baseline Throughput (MiB/s) | New Run Throughput (MiB/s) | Delta (%) | Status |
+| :--- | :--- | :--- | :--- | :--- |
+| HTTP/1.1 | 1240.5 | 1235.2 | -0.4% | Neutral |
+| gRPC | 3450.0 | 2890.5 | -16.2% | **REGRESSION** |
+
+### Step 3: Verify Machine Type Configuration
+
+Verify if the GCE VM or GKE node machine type (e.g., `c4-standard-96`) is classified under the high-performance machine types in the main GCSFuse repository:
+1.  Locate `params.yaml` in the cloned GCSFuse repository.
+2.  Search for the machine family or type.
+3.  If missing, note it in the validation report as a required follow-up task (PR creation to add family).
+
+### Step 4: Generate `npi_validation_report.md`
+
+Compile the queried results, baselines comparison, and machine family configuration verification into `npi_validation_report.md`.
+
+For each target validation environment executed (GCE VM or GKE Cluster), create a separate section and performance table to isolate their metrics and prevent incorrect direct comparisons.
+
+The report must follow this structure:
+```markdown
+# GCSFuse NPI Validation Report
+
+## Executive Summary
+[Brief description of whether the run meets performance criteria and if any regressions/failures were detected.]
+
+## Run Details
+- **Timestamp**: [ISO 8601 Timestamp]
+- **Target Platforms**: [List of all target names, e.g. GCE VM target-1, GKE Cluster target-2, etc.]
+
+## Target Performance Results
+
+### [TARGET_NAME_1] (Platform Type, e.g., GCE VM)
+- **GCSFuse Version**: [e.g. v3.9.0]
+- **Target Bucket**: [RAPID / Regional]
+- **Performance Metrics vs Baseline**:
+| Benchmark / Protocol | Baseline (Version) | Current Run (Version) | Delta (%) | Status |
+|---|---|---|---|---|
+| HTTP1 Read | 1250 MB/s | 1240 MB/s | -0.8% | PASS |
+| gRPC Read | 3500 MB/s | 2800 MB/s | -20.0% | FAIL (Regression) |
+
+## High-Performance Machine Type Classification
+- **Machine Type Used**: `c4-standard-96`
+- **Configured in `params.yaml`?**: [Yes/No]
+- **Action Required**: [None / Create PR in GCSFuse repo to add the machine type]
+
+## Observations & Issues
+- [Detail any errors observed, e.g., TLS Handshake Errors, GKE OOMs, Direct Path fallback issues.]
+```
diff --git a/npi/.gemini/skills/benchmark-build-setup/SKILL.md b/npi/.gemini/skills/benchmark-build-setup/SKILL.md
@@ -0,0 +1,75 @@
+---
+name: benchmark-build-setup
+description: Guides on checking out GCSFuse, configuring targets (Docker/RAM disks), and building/pushing benchmarking images to Artifact Registry.
+---
+
+# Benchmark Build and Setup for GCSFuse NPI
+
+This skill guides you through checking out the GCSFuse repository, configuring target VMs with storage buffers and Docker, and building/pushing benchmark images to Google Artifact Registry.
+
+## Step 1: Clone/Prepare GCSFuse & Custom Matrices
+
+1.  **Clone / Verify GCSFuse Repository**:
+    Verify that the GCSFuse repository or sub-module is checked out locally to the expected branch or tag.
+2.  **Smoke-Test Matrix Customization**:
+    If running quick verification or smoke tests, edit the matrix files to execute only minimal iterations:
+    *   Edit: `fio/read_matrix.csv`
+    *   Edit: `fio/write_matrix.csv`
+    *(Note: Remember to run `git restore fio/read_matrix.csv fio/write_matrix.csv` after the images are built and pushed to avoid checking in modified matrices).*
+
+## Step 2: Configure Target VMs
+
+Configure the storage buffer and Docker workspace on each target VM using the established SSH master connection socket.
+
+### A. Configure Storage Buffer
+*   **Unified Buffer Setup (Local SSD or RAM Fallback)**:
+    Execute `raid0-script.sh` on the target VM, passing the target mount path (from `targets.json`'s `buffer_mount`) as the argument. The script will automatically build a RAID0 array from local SSDs if present. If no local SSDs are found, it will verify that the host has at least 600GB of RAM and mount a 600GB memory volume (`tmpfs`) at the mount path instead:
+    ```bash
+    # Copy script to target
+    scp -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine raid0-script.sh <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com:~/raid0-script.sh
+
+    # Run script with the target mount path argument (e.g. /mnt/lssd)
+    ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "bash ~/raid0-script.sh <SSD_MOUNT_PATH>"
+    ```
+
+### B. Install Docker & Configure Permissions
+Install Docker on the target VM and add the SSH user to the docker group:
+```bash
+ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "curl -fsSL https://get.docker.com -o get-docker.sh && sudo sh get-docker.sh && sudo usermod -aG docker \$USER && rm get-docker.sh"
+```
+
+**CRITICAL**: Since group memberships are only evaluated at session startup, recreate the SSH multiplexing socket to apply the docker group changes:
+1.  Close socket: `rm -f ~/.ssh/sockets/<TARGET_NAME>.sock`
+2.  Re-establish the connection socket (using the `ssh-connection-management` skill).
+
+### C. Configure Registry Access on Target
+Enable the target VM docker daemon to pull images from Artifact Registry:
+```bash
+ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "gcloud auth configure-docker us-docker.pkg.dev -q"
+```
+
+## Step 3: Build & Push Benchmark Images
+
+Build and push GCSFuse benchmarking images (with FIO/Go-Client inside) to your Google Artifact Registry:
+
+1.  **Configure Registry Auth Locally**:
+    Ensure local credentials can write to the Artifact Registry:
+    ```bash
+    gcloud auth configure-docker us-docker.pkg.dev
+    ```
+2.  **Execute Build Script**:
+    ```bash
+    python3 build_images.py --project <PROJECT_ID> --image-version <IMAGE_VERSION> --gcsfuse-version <GCSFUSE_VERSION>
+    ```
+3.  **Restore Matrices**:
+    If you customized matrix files in Step 1, revert the local changes:
+    ```bash
+    git restore fio/read_matrix.csv fio/write_matrix.csv
+    ```
+
+## Step 4: Verify Image Availability
+
+Verify that the benchmark image is successfully pushed and available in Artifact Registry:
+```bash
+gcloud artifacts docker images list us-docker.pkg.dev/<PROJECT_ID>/gcsfuse-npi-images --image-format='value(format("{0}:{1}",package,tag))' | grep "<IMAGE_VERSION>"
+```
diff --git a/npi/.gemini/skills/benchmark-suite-execution/SKILL.md b/npi/.gemini/skills/benchmark-suite-execution/SKILL.md
@@ -0,0 +1,98 @@
+---
+name: benchmark-suite-execution
+description: Guides on configuring, executing, monitoring, and exporting results of GCSFuse NPI benchmark runs on GCE and GKE.
+---
+
+# Benchmark Suite Execution for GCSFuse NPI
+
+This skill guides you through defining target environments, executing benchmarks concurrently on GCE and GKE, monitoring execution for hangs/failures, and verifying BigQuery export status.
+
+## Step 1: Configuration & Targets Setup
+
+Before starting a run, collect target details and populate `targets.json` in the root configuration directory.
+
+### A. Collect Inputs
+1.  **Target Platforms**: Determine if running on GCE VM, GKE cluster, or both.
+2.  **GCE VM Details** (if applicable): VM Name, zone, SSD presence, SSD mount path (e.g. `/mnt/lssd` or `/tmp/npi_buffer`), and RAPID bucket usage.
+3.  **GKE Details** (if applicable): Intermediate VM details, Cluster name, region/zone, SSD/RAM configuration, RAPID bucket usage, node selectors, resource limits.
+4.  **GCS Buckets**: Target regional and/or RAPID (zonal) bucket names.
+5.  **GCP Project**: GCP Project ID (e.g. `gcs-fuse-test`).
+
+### B. Configure `targets.json`
+Populate `targets.json` with the corresponding target details. Format:
+```json
+[
+  {
+    "name": "gce-c4-ssd",
+    "type": "gce",
+    "vm_name": "<GCE_VM_NAME>",
+    "zone": "<GCE_ZONE>",
+    "bucket": "<REGIONAL_BUCKET>",
+    "dataset": "<BQ_DATASET_PREFIX>",
+    "buffer_mount": "<SSD_MOUNT_PATH>",
+    "has_ssd": true,
+    "is_rapid_bucket": false
+  },
+  {
+    "name": "gke-tpu-slice",
+    "type": "gke",
+    "vm_name": "<GKE_INTERMEDIATE_VM_NAME>",
+    "zone": "<GKE_INTERMEDIATE_VM_ZONE>",
+    "cluster_name": "<GKE_CLUSTER_NAME>",
+    "location": "<GKE_CLUSTER_LOCATION>",
+    "bucket": "<REGIONAL_BUCKET>",
+    "dataset": "<BQ_DATASET_PREFIX>",
+    "node_selector": "cloud.google.com/gke-accelerator-count=4,cloud.google.com/gke-nodepool=ct6e-pool,cloud.google.com/gke-tpu-accelerator=tpu-v6e-slice,cloud.google.com/gke-tpu-topology=2x2",
+    "resources_limits": "google.com/tpu=4",
+    "has_ssd": false,
+    "is_rapid_bucket": true
+  }
+]
+```
+
+## Step 2: Prerequisites Validation
+
+1.  **Local Authentication**: Ensure `gcloud` and `kubectl` are configured locally.
+2.  **Access Verification**: Verify direct path routing and quota if using RAPID buckets.
+3.  **Logging**: Ensure all execution commands are logged locally/remotely to `npi_commands.log`.
+
+## Step 3: Execute Orchestrated Benchmarks
+
+1.  **State Reset**:
+    *   **Clean Run / Retrigger**: If starting a completely fresh run or recovering from a corrupted state, clean up state files:
+        ```bash
+        rm -f ~/.npi/npi_run_state.json
+        ```
+        *(This forces a clean retry, terminating active containers, mounts, and jobs before sync and relaunch).*
+    *   **Resume Run**: To resume an active background run without starting over, keep the state file intact.
+
+2.  **Execute Orchestrator**:
+    Run the orchestrator script, specifying the benchmarks, image version, and iterations:
+    ```bash
+    python3 npi_orchestrator.py --benchmarks "<BENCHMARK_LIST>" --image-version <IMAGE_VERSION> --iterations <ITERATION_COUNT>
+    ```
+    Examples of `<BENCHMARK_LIST>`: `read_parallel,write_parallel` or `all`.
+
+## Step 4: Monitor and Safety Policies
+
+Observe logs for the following active safety policies enforced by the orchestrator:
+
+1.  **Inactivity Timeout**:
+    *   Logs are monitored continuously by `npi_orchestrator.py`.
+    *   If no log output is detected for 5 minutes (300s), the run is automatically aborted to protect against hangs.
+2.  **Disk Space Protection**:
+    *   If target storage buffer disk space exceeds 85%, GCE runs are immediately aborted to prevent out-of-disk failures.
+3.  **GKE TPU Memory Management**:
+    *   Ensure `--use-memory-volumes` flag is enabled in `npi_gke.py` to mount buffers in RAM.
+    *   Avoid running file cache tests (`read_file_cache`) on TPU slices to prevent host Out-Of-Memory (OOM) situations.
+
+## Step 5: Verify BigQuery Results Export
+
+Upon successful completion, the orchestrator uploads FIO and Go Client JSON output to BigQuery.
+Verify the upload:
+1.  Locate the dataset: `<BQ_DATASET_PREFIX>` configured in `targets.json`.
+2.  Query the BQ table using bq tool or Google Cloud Console:
+    ```sql
+    SELECT COUNT(*) FROM `<PROJECT_ID>.<BQ_DATASET_PREFIX>_dataset.fio_results` WHERE image_version = '<IMAGE_VERSION>'
+    ```
+3.  Ensure the count matches the expected number of iterations and test runs.