Skip to content
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3421db4
feat(npi): modularize skills, add verification utility and orchestrat…
kislaykishore Jun 14, 2026
654fd00
feat(npi): add gcsfuse-npi-runner agent definition file
kislaykishore Jun 14, 2026
b27467c
feat(npi): update runner agent prompt to enforce user input targets
kislaykishore Jun 14, 2026
290ec4e
feat(npi): update runner agent prompt to check active run state on st…
kislaykishore Jun 14, 2026
55aaa24
feat(npi): dynamically resolve go version from GCSFuse go.mod during …
kislaykishore Jun 14, 2026
1ebe734
feat(npi): add failure analysis guidelines for conformance testing
kislaykishore Jun 14, 2026
9c1289f
docs(npi): exclude emulator_tests from conformance guide to prevent hang
kislaykishore Jun 15, 2026
89a2b57
perf(npi): reduce log inactivity check from 60 mins to 5 mins
kislaykishore Jun 15, 2026
d65eeb5
feat(npi): incorporate conformance and performance stall monitoring g…
kislaykishore Jun 15, 2026
a3b7725
docs(npi): separate GKE and GCE performance runs and remove direct co…
kislaykishore Jun 15, 2026
e5232c3
docs(npi): add cross-target comparison and advisory-only warnings to …
kislaykishore Jun 15, 2026
9bbe4d8
chore(npi): remove hardcoded personal VM, bucket, and path strings fr…
kislaykishore Jun 15, 2026
cff67dc
fix(npi): relax total_tests check in verify_agent_workflow.py to a mi…
kislaykishore Jun 15, 2026
66dda45
docs(npi): update runner agent prompt to support list of targets in a…
kislaykishore Jun 15, 2026
16ee1a4
docs(npi): update skills to support multiple target verification repo…
kislaykishore Jun 15, 2026
f251783
docs(npi): clarify that conformance tests are only supported on GCE t…
kislaykishore Jun 15, 2026
edd19bc
fix(npi): update benchmark query scripts and skill report templates t…
kislaykishore Jun 15, 2026
37e55e1
fix(npi): remove LRO ON TLS handshake failure special-case handler fr…
kislaykishore Jun 15, 2026
41b75b0
feat(npi): support tmpfs RAM disk fallback buffer of 600GB if SSDs ar…
kislaykishore Jun 15, 2026
57cf831
revert(npi): restore targets.json to its original template state from…
kislaykishore Jun 15, 2026
a474aac
docs(npi): allow concurrent target runs in different regions/hosts in…
kislaykishore Jun 15, 2026
5edd292
revert(npi): restore Sequential Execution constraint in runner agent …
kislaykishore Jun 15, 2026
2c05674
fix(npi): resolve code review feedback on metadata parsing and timeou…
kislaykishore Jun 15, 2026
80b764b
fix(npi): resolve code review comments regarding test count threshold…
kislaykishore Jun 15, 2026
8111636
docs(npi): specify Linux-only environment support in runner agent
kislaykishore Jun 15, 2026
89fa5f3
fix(npi): resolve code review comments on JSON type assertions, safe …
kislaykishore Jun 15, 2026
61cd1c3
fix(npi): resolve code review on GKE template metadata initialization…
kislaykishore Jun 15, 2026
bf914f1
fix(npi): sanitize inputs and whitelist parameters to prevent path tr…
kislaykishore Jun 16, 2026
2505004
fix(npi): validate all conformance results JSON files instead of only…
kislaykishore Jun 16, 2026
6c137c7
fix(npi): limit fallback tmpfs memory volume to 500GB to leave system…
kislaykishore Jun 16, 2026
70ae2cd
fix(npi): use startswith to precisely match billing-project in GKE mo…
kislaykishore Jun 16, 2026
6fba98e
fix(npi): support comma-separated GKE mount options and always verify…
kislaykishore Jun 16, 2026
15fbb5b
fix(npi): normalize whitespace around '=' in GKE mount options to pre…
kislaykishore Jun 16, 2026
76b6aef
style(npi): align codebase with PEP 8 import guidelines in build_imag…
kislaykishore Jun 16, 2026
2edfc2f
Refactor NPI validation skills and agent specs to support reusing exi…
kislaykishore Jun 20, 2026
18e72d1
Remove automatic billing-project injection from GKE job orchestrator
kislaykishore Jun 20, 2026
71a3dba
docs/feat(npi): document mandatory integration test flags and add rob…
kislaykishore Jun 20, 2026
76cff1c
docs(npi): codify interactive plan checkpoint and smoke-test verifica…
kislaykishore Jun 20, 2026
ec1f2de
docs(npi): enhance Interactive Plan Checkpoint to require detailed st…
kislaykishore Jun 20, 2026
edb6cfa
docs(npi): document host-info run and correct BigQuery verification t…
kislaykishore Jun 20, 2026
254bfe5
docs(npi): add host-info query examples and hardware profile reportin…
kislaykishore Jun 20, 2026
2904bdc
feat(npi): refactor run_conformance.sh to be fully dynamic and suppor…
kislaykishore Jun 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions npi/.gemini/agents/gcsfuse-npi-runner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: gcsfuse-npi-runner
description: Subagent that orchestrates and executes the end-to-end GCSFuse NPI validation pipeline sequentially: Conformance Testing -> Performance Benchmarking -> Analysis & Report -> Remediation.
enable_write_tools: true
enable_subagent_tools: true
enable_mcp_tools: true
---

# GCSFuse NPI Runner Agent

You are a specialized GCSFuse NPI Runner agent. Your mission is to execute the complete Node Platform Integration (NPI) validation workflow sequentially against GCE VM and GKE cluster targets.

## Workflow Sequence
You must run the workflow stages strictly in the following sequential order:
1. **SSH Connection Prep**: Clean up any stale sockets and establish persistent multiplexed SSH connections.
2. **Conformance Testing**: Clone the GCSFuse repo and execute the integration test suite on the target VM, producing `conformance_results.json`.
3. **Performance Benchmarking**: Build/push benchmarking images, run the benchmark suite via `npi_orchestrator.py`, and upload metrics to BigQuery.
4. **Analysis**: Extract metrics, compare throughput/latency against baselines, and compile `npi_validation_report.md`.
5. **Remediation**: Analyze conformance failures and configuration mismatches, producing `npi_remediation_plan.md`.
6. **Verification**: Execute `verify_agent_workflow.py` to programmatically verify all deliverables are valid.

## Key Constraints
- **Sequential Execution**: Do not run conformance testing and performance benchmarking concurrently on target VMs to avoid resource contention.
- **Linux Environment Only**: The validation runner, scripts, and skills are designed and supported exclusively for Linux operating systems. Do not attempt to run or adapt commands for other environments (e.g., macOS or Windows).
- **Socket Cleanup**: Stale socket files (`~/.ssh/sockets/<target>.sock`) must be checked and deleted before establishing master SSH connections.
- **Agnostic Code**: Do not hardcode VM or cluster names in execution scripts. Keep configurations dynamic via targets inputs.
- **User-Defined Targets**: You must not guess or auto-discover target GCE VM names, GKE cluster names, or GCS bucket names. You must explicitly extract these details from the user's prompt or request and write them to `targets.json`.
- **Check Active State**: Before executing the SSH connections or starting a benchmark run, check if `~/.npi/npi_run_state.json` exists locally. If it exists and contains active target statuses (e.g. `RUNNING` or `SUCCESS`), notify the user of the active/previous run state, and ask if they would like to re-attach/resume or trigger a clean reset (using `--reset`).
- **Analyze Permission Failures**: Conformance tests are expected to have failures due to intentionally restricted permissions. Do not block the pipeline trying to resolve these or force all tests to pass. Instead, analyze the failure reasons (e.g., identify which service accounts lack which GCS permissions) and detail them clearly in `npi_validation_report.md`.
- **Stall Monitoring**: Monitor both conformance tests and performance benchmarks for stalls. For conformance tests, verify that `~/integration_tests.log` size increases. If the log size remains unchanged for more than 5 minutes while the `go test` process is running, consider it stalled, immediately terminate the run, force-unmount leftovers, clean up temp directories to reclaim inodes, and document the details. For performance benchmarks, ensure `npi_orchestrator.py` has `MAX_INACTIVITY_SECS` configured appropriately (typically 14400s or 4 hours for full runs) so it auto-aborts and reports hangs.
- **No Automated Remediation**: Do not automatically perform or execute any remediation steps on the GCE VMs or GKE nodes. Document findings and suggest remediation recommendations in `npi_remediation_plan.md` as an advisory, but do not apply or execute them.
- **Independent Target Evaluation**: Unless otherwise specified, multiple benchmark runs executed together are separate and not directly comparable. Do not compare their metrics directly against each other. Present the performance results for each target in separate sections, evaluating each target independently against its own baseline.
- **RAM Buffer Fallback**: For targets without local SSDs (`has_ssd: false`), verify that the VM host has at least 600GB of RAM (minimum 550GB detected due to kernel overhead). If so, mount a 500GB memory volume (`tmpfs`) at the configured `buffer_mount` directory as the performance test buffer using the setup script. This leaves safe memory headroom for OS and daemon processes.

## Required Input Parameters
Before starting execution, extract the list of target validation environments from the user's request:
- **Validation Targets**: A list of one or more targets to run. Each target can be GCE (VM name, zone, bucket, BQ dataset, buffer mount SSD options) or GKE (cluster name, location, VM name, zone, bucket, BQ dataset, node selector, etc.), in any combination (e.g., multiple GCE, multiple GKE, or a mix of both).

If the target configuration is missing or ambiguous in the request, ask the user to specify them. Once collected, write the entire list of targets to `targets.json` to parameterize the execution.

## Skills & Methods
Refer to the modular skills in the workspace for step-by-step guidance:
- SSH Connection: `.gemini/skills/ssh-connection-management/SKILL.md`
- Conformance: `.gemini/skills/conformance-testing/SKILL.md`
- Build & Setup: `.gemini/skills/benchmark-build-setup/SKILL.md`
- Benchmarking: `.gemini/skills/benchmark-suite-execution/SKILL.md`
- Analysis: `.gemini/skills/analysis-report-generation/SKILL.md`
- Remediation: `.gemini/skills/remediation-advisor/SKILL.md`
97 changes: 97 additions & 0 deletions npi/.gemini/skills/analysis-report-generation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
name: analysis-report-generation
description: Guides on querying benchmark results from BigQuery, comparing performance metrics (throughput/latency) against baselines, and generating a structured validation report.
---

# GCSFuse NPI Analysis & Report Generation

This skill guides you through querying benchmark results from BigQuery tables, performing analysis on throughput and latency trends against historical baselines, verifying machine type configuration optimizations, and compiling the findings into a standard `npi_validation_report.md`.

## Prerequisites

1. **GCP/BigQuery Access**: The environment must have access to BigQuery dataset containing the benchmark outputs.
2. **Baselines Datasets**: Ensure you know the baseline dataset ID (e.g. `npi_benchmarks_baseline_lro_on` or similar) and the newly generated run's dataset ID.
3. **GCSFuse Source Code**: Access to GCSFuse code is required to inspect `params.yaml` for machine type verification.

## Step-by-Step Procedure

### Step 1: Query BigQuery Results

Retrieve performance data from the respective benchmark tables (e.g., `go_client_read_http1`, `go_client_read_grpc`, `fio_write_grpc`).

> [!IMPORTANT]
> **JSON Key Spacing**: In the FIO JSON output, the version is stored under the key `"fio version"` (with a space). Always query it using the quoted format: `JSON_VALUE(fio_json_output, '$."fio version"')` to avoid returning `NULL`.

Run queries using the `bq` CLI or a python BigQuery client:
```bash
bq query --project_id=<PROJECT_ID> --use_legacy_sql=false \
"SELECT
run_timestamp,
iteration,
JSON_VALUE(fio_json_output, '\$.\"fio version\"') AS fio_version,
AVG(SAFE_CAST(JSON_VALUE(job.read.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_read_bw_mb,
AVG(SAFE_CAST(JSON_VALUE(job.write.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_write_bw_mb,
AVG(SAFE_CAST(JSON_VALUE(job.read.clat_ns.mean) AS FLOAT64)) / 1000000.0 AS avg_read_clat_ms
FROM
\`<PROJECT_ID>.<DATASET_ID>.<TABLE_ID>\`,
UNNEST(JSON_EXTRACT_ARRAY(fio_json_output.jobs)) AS job
GROUP BY 1, 2, 3
ORDER BY run_timestamp DESC"
```

### Step 2: Compare Against Baselines

Execute comparison scripts (e.g. `query_results.py`) or calculate the percentage difference in throughput/latency between baseline and regression datasets.

> [!IMPORTANT]
> **No Cross-Target Comparisons**: Performance results from different targets (e.g., GKE Node runs vs GCE VM runs) represent distinct platforms and are not directly comparable. Do not compare them against each other, compute cross-target deltas, or label differences between them as regressions. Compare each environment exclusively against its own respective historical baseline.

Example Comparison Matrix:
| Protocol | Baseline Throughput (MiB/s) | New Run Throughput (MiB/s) | Delta (%) | Status |
| :--- | :--- | :--- | :--- | :--- |
| HTTP/1.1 | 1240.5 | 1235.2 | -0.4% | Neutral |
| gRPC | 3450.0 | 2890.5 | -16.2% | **REGRESSION** |

### Step 3: Verify Machine Type Configuration

Verify if the GCE VM or GKE node machine type (e.g., `c4-standard-96`) is classified under the high-performance machine types in the main GCSFuse repository:
1. Locate `params.yaml` in the cloned GCSFuse repository.
2. Search for the machine family or type.
3. If missing, note it in the validation report as a required follow-up task (PR creation to add family).

### Step 4: Generate `npi_validation_report.md`

Compile the queried results, baselines comparison, and machine family configuration verification into `npi_validation_report.md`.

For each target validation environment executed (GCE VM or GKE Cluster), create a separate section and performance table to isolate their metrics and prevent incorrect direct comparisons.

The report must follow this structure:
```markdown
# GCSFuse NPI Validation Report

## Executive Summary
[Brief description of whether the run meets performance criteria and if any regressions/failures were detected.]

## Run Details
- **Timestamp**: [ISO 8601 Timestamp]
- **Target Platforms**: [List of all target names, e.g. GCE VM target-1, GKE Cluster target-2, etc.]

## Target Performance Results

### [TARGET_NAME_1] (Platform Type, e.g., GCE VM)
- **GCSFuse Version**: [e.g. v3.9.0]
- **Target Bucket**: [RAPID / Regional]
- **Performance Metrics vs Baseline**:
| Benchmark / Protocol | Baseline (Version) | Current Run (Version) | Delta (%) | Status |
|---|---|---|---|---|
| HTTP1 Read | 1250 MB/s | 1240 MB/s | -0.8% | PASS |
| gRPC Read | 3500 MB/s | 2800 MB/s | -20.0% | FAIL (Regression) |

## High-Performance Machine Type Classification
- **Machine Type Used**: `c4-standard-96`
- **Configured in `params.yaml`?**: [Yes/No]
- **Action Required**: [None / Create PR in GCSFuse repo to add the machine type]

## Observations & Issues
- [Detail any errors observed, e.g., TLS Handshake Errors, GKE OOMs, Direct Path fallback issues.]
```
75 changes: 75 additions & 0 deletions npi/.gemini/skills/benchmark-build-setup/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
name: benchmark-build-setup
description: Guides on checking out GCSFuse, configuring targets (Docker/RAM disks), and building/pushing benchmarking images to Artifact Registry.
---

# Benchmark Build and Setup for GCSFuse NPI

This skill guides you through checking out the GCSFuse repository, configuring target VMs with storage buffers and Docker, and building/pushing benchmark images to Google Artifact Registry.

## Step 1: Clone/Prepare GCSFuse & Custom Matrices

1. **Clone / Verify GCSFuse Repository**:
Verify that the GCSFuse repository or sub-module is checked out locally to the expected branch or tag.
2. **Smoke-Test Matrix Customization**:
If running quick verification or smoke tests, edit the matrix files to execute only minimal iterations:
* Edit: `fio/read_matrix.csv`
* Edit: `fio/write_matrix.csv`
*(Note: Remember to run `git restore fio/read_matrix.csv fio/write_matrix.csv` after the images are built and pushed to avoid checking in modified matrices).*

## Step 2: Configure Target VMs

Configure the storage buffer and Docker workspace on each target VM using the established SSH master connection socket.

### A. Configure Storage Buffer
* **Unified Buffer Setup (Local SSD or RAM Fallback)**:
Execute `raid0-script.sh` on the target VM, passing the target mount path (from `targets.json`'s `buffer_mount`) as the argument. The script will automatically build a RAID0 array from local SSDs if present. If no local SSDs are found, it will verify that the host has at least 600GB of RAM (minimum 550GB detected due to kernel reservations) and mount a 500GB memory volume (`tmpfs`) at the mount path instead:
```bash
# Copy script to target
scp -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine raid0-script.sh <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com:~/raid0-script.sh

# Run script with the target mount path argument (e.g. /mnt/lssd)
ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "bash ~/raid0-script.sh <SSD_MOUNT_PATH>"
```

### B. Install Docker & Configure Permissions
Install Docker on the target VM and add the SSH user to the docker group:
```bash
ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "curl -fsSL https://get.docker.com -o get-docker.sh && sudo sh get-docker.sh && sudo usermod -aG docker \$USER && rm get-docker.sh"
```

**CRITICAL**: Since group memberships are only evaluated at session startup, recreate the SSH multiplexing socket to apply the docker group changes:
1. Close socket: `rm -f ~/.ssh/sockets/<TARGET_NAME>.sock`
2. Re-establish the connection socket (using the `ssh-connection-management` skill).

### C. Configure Registry Access on Target
Enable the target VM docker daemon to pull images from Artifact Registry:
```bash
ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "gcloud auth configure-docker us-docker.pkg.dev -q"
```

## Step 3: Build & Push Benchmark Images

Build and push GCSFuse benchmarking images (with FIO/Go-Client inside) to your Google Artifact Registry:

1. **Configure Registry Auth Locally**:
Ensure local credentials can write to the Artifact Registry:
```bash
gcloud auth configure-docker us-docker.pkg.dev
```
2. **Execute Build Script**:
```bash
python3 build_images.py --project <PROJECT_ID> --image-version <IMAGE_VERSION> --gcsfuse-version <GCSFUSE_VERSION>
```
3. **Restore Matrices**:
If you customized matrix files in Step 1, revert the local changes:
```bash
git restore fio/read_matrix.csv fio/write_matrix.csv
```

## Step 4: Verify Image Availability

Verify that the benchmark image is successfully pushed and available in Artifact Registry:
```bash
gcloud artifacts docker images list us-docker.pkg.dev/<PROJECT_ID>/gcsfuse-npi-images --image-format='value(format("{0}:{1}",package,tag))' | grep "<IMAGE_VERSION>"
```
98 changes: 98 additions & 0 deletions npi/.gemini/skills/benchmark-suite-execution/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
name: benchmark-suite-execution
description: Guides on configuring, executing, monitoring, and exporting results of GCSFuse NPI benchmark runs on GCE and GKE.
---

# Benchmark Suite Execution for GCSFuse NPI

This skill guides you through defining target environments, executing benchmarks concurrently on GCE and GKE, monitoring execution for hangs/failures, and verifying BigQuery export status.

## Step 1: Configuration & Targets Setup

Before starting a run, collect target details and populate `targets.json` in the root configuration directory.

### A. Collect Inputs
1. **Target Platforms**: Determine if running on GCE VM, GKE cluster, or both.
2. **GCE VM Details** (if applicable): VM Name, zone, SSD presence, SSD mount path (e.g. `/mnt/lssd` or `/tmp/npi_buffer`), and RAPID bucket usage.
3. **GKE Details** (if applicable): Intermediate VM details, Cluster name, region/zone, SSD/RAM configuration, RAPID bucket usage, node selectors, resource limits.
4. **GCS Buckets**: Target regional and/or RAPID (zonal) bucket names.
5. **GCP Project**: GCP Project ID (e.g. `gcs-fuse-test`).

### B. Configure `targets.json`
Populate `targets.json` with the corresponding target details. Format:
```json
[
{
"name": "gce-c4-ssd",
"type": "gce",
"vm_name": "<GCE_VM_NAME>",
"zone": "<GCE_ZONE>",
"bucket": "<REGIONAL_BUCKET>",
"dataset": "<BQ_DATASET_PREFIX>",
"buffer_mount": "<SSD_MOUNT_PATH>",
"has_ssd": true,
"is_rapid_bucket": false
},
{
"name": "gke-tpu-slice",
"type": "gke",
"vm_name": "<GKE_INTERMEDIATE_VM_NAME>",
"zone": "<GKE_INTERMEDIATE_VM_ZONE>",
"cluster_name": "<GKE_CLUSTER_NAME>",
"location": "<GKE_CLUSTER_LOCATION>",
"bucket": "<REGIONAL_BUCKET>",
"dataset": "<BQ_DATASET_PREFIX>",
"node_selector": "cloud.google.com/gke-accelerator-count=4,cloud.google.com/gke-nodepool=ct6e-pool,cloud.google.com/gke-tpu-accelerator=tpu-v6e-slice,cloud.google.com/gke-tpu-topology=2x2",
"resources_limits": "google.com/tpu=4",
"has_ssd": false,
"is_rapid_bucket": true
}
]
```

## Step 2: Prerequisites Validation

1. **Local Authentication**: Ensure `gcloud` and `kubectl` are configured locally.
2. **Access Verification**: Verify direct path routing and quota if using RAPID buckets.
3. **Logging**: Ensure all execution commands are logged locally/remotely to `npi_commands.log`.

## Step 3: Execute Orchestrated Benchmarks

1. **State Reset**:
* **Clean Run / Retrigger**: If starting a completely fresh run or recovering from a corrupted state, clean up state files:
```bash
rm -f ~/.npi/npi_run_state.json
```
*(This forces a clean retry, terminating active containers, mounts, and jobs before sync and relaunch).*
* **Resume Run**: To resume an active background run without starting over, keep the state file intact.

2. **Execute Orchestrator**:
Run the orchestrator script, specifying the benchmarks, image version, and iterations:
```bash
python3 npi_orchestrator.py --benchmarks "<BENCHMARK_LIST>" --image-version <IMAGE_VERSION> --iterations <ITERATION_COUNT>
```
Examples of `<BENCHMARK_LIST>`: `read_parallel,write_parallel` or `all`.

## Step 4: Monitor and Safety Policies

Observe logs for the following active safety policies enforced by the orchestrator:

1. **Inactivity Timeout**:
* Logs are monitored continuously by `npi_orchestrator.py`.
* If no log output is detected for 4 hours (14400s), the run is automatically aborted to protect against hangs.
2. **Disk Space Protection**:
* If target storage buffer disk space exceeds 85%, GCE runs are immediately aborted to prevent out-of-disk failures.
3. **GKE TPU Memory Management**:
* Ensure `--use-memory-volumes` flag is enabled in `npi_gke.py` to mount buffers in RAM.
* Avoid running file cache tests (`read_file_cache`) on TPU slices to prevent host Out-Of-Memory (OOM) situations.

## Step 5: Verify BigQuery Results Export

Upon successful completion, the orchestrator uploads FIO and Go Client JSON output to BigQuery.
Verify the upload:
1. Locate the dataset: `<BQ_DATASET_PREFIX>` configured in `targets.json`.
2. Query the BQ table using bq tool or Google Cloud Console:
```sql
SELECT COUNT(*) FROM `<PROJECT_ID>.<BQ_DATASET_PREFIX>_dataset.fio_results` WHERE image_version = '<IMAGE_VERSION>'
```
3. Ensure the count matches the expected number of iterations and test runs.
Loading
Loading