-
Notifications
You must be signed in to change notification settings - Fork 5
feat(npi): Implement agentic approach to GCSFuse NPI #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 22 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
3421db4
feat(npi): modularize skills, add verification utility and orchestrat…
kislaykishore 654fd00
feat(npi): add gcsfuse-npi-runner agent definition file
kislaykishore b27467c
feat(npi): update runner agent prompt to enforce user input targets
kislaykishore 290ec4e
feat(npi): update runner agent prompt to check active run state on st…
kislaykishore 55aaa24
feat(npi): dynamically resolve go version from GCSFuse go.mod during …
kislaykishore 1ebe734
feat(npi): add failure analysis guidelines for conformance testing
kislaykishore 9c1289f
docs(npi): exclude emulator_tests from conformance guide to prevent hang
kislaykishore 89a2b57
perf(npi): reduce log inactivity check from 60 mins to 5 mins
kislaykishore d65eeb5
feat(npi): incorporate conformance and performance stall monitoring g…
kislaykishore a3b7725
docs(npi): separate GKE and GCE performance runs and remove direct co…
kislaykishore e5232c3
docs(npi): add cross-target comparison and advisory-only warnings to …
kislaykishore 9bbe4d8
chore(npi): remove hardcoded personal VM, bucket, and path strings fr…
kislaykishore cff67dc
fix(npi): relax total_tests check in verify_agent_workflow.py to a mi…
kislaykishore 66dda45
docs(npi): update runner agent prompt to support list of targets in a…
kislaykishore 16ee1a4
docs(npi): update skills to support multiple target verification repo…
kislaykishore f251783
docs(npi): clarify that conformance tests are only supported on GCE t…
kislaykishore edd19bc
fix(npi): update benchmark query scripts and skill report templates t…
kislaykishore 37e55e1
fix(npi): remove LRO ON TLS handshake failure special-case handler fr…
kislaykishore 41b75b0
feat(npi): support tmpfs RAM disk fallback buffer of 600GB if SSDs ar…
kislaykishore 57cf831
revert(npi): restore targets.json to its original template state from…
kislaykishore a474aac
docs(npi): allow concurrent target runs in different regions/hosts in…
kislaykishore 5edd292
revert(npi): restore Sequential Execution constraint in runner agent …
kislaykishore 2c05674
fix(npi): resolve code review feedback on metadata parsing and timeou…
kislaykishore 80b764b
fix(npi): resolve code review comments regarding test count threshold…
kislaykishore 8111636
docs(npi): specify Linux-only environment support in runner agent
kislaykishore 89fa5f3
fix(npi): resolve code review comments on JSON type assertions, safe …
kislaykishore 61cd1c3
fix(npi): resolve code review on GKE template metadata initialization…
kislaykishore bf914f1
fix(npi): sanitize inputs and whitelist parameters to prevent path tr…
kislaykishore 2505004
fix(npi): validate all conformance results JSON files instead of only…
kislaykishore 6c137c7
fix(npi): limit fallback tmpfs memory volume to 500GB to leave system…
kislaykishore 70ae2cd
fix(npi): use startswith to precisely match billing-project in GKE mo…
kislaykishore 6fba98e
fix(npi): support comma-separated GKE mount options and always verify…
kislaykishore 15fbb5b
fix(npi): normalize whitespace around '=' in GKE mount options to pre…
kislaykishore 76b6aef
style(npi): align codebase with PEP 8 import guidelines in build_imag…
kislaykishore 2edfc2f
Refactor NPI validation skills and agent specs to support reusing exi…
kislaykishore 18e72d1
Remove automatic billing-project injection from GKE job orchestrator
kislaykishore 71a3dba
docs/feat(npi): document mandatory integration test flags and add rob…
kislaykishore 76cff1c
docs(npi): codify interactive plan checkpoint and smoke-test verifica…
kislaykishore ec1f2de
docs(npi): enhance Interactive Plan Checkpoint to require detailed st…
kislaykishore edb6cfa
docs(npi): document host-info run and correct BigQuery verification t…
kislaykishore 254bfe5
docs(npi): add host-info query examples and hardware profile reportin…
kislaykishore 2904bdc
feat(npi): refactor run_conformance.sh to be fully dynamic and suppor…
kislaykishore File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| --- | ||
| name: gcsfuse-npi-runner | ||
| description: Subagent that orchestrates and executes the end-to-end GCSFuse NPI validation pipeline sequentially: Conformance Testing -> Performance Benchmarking -> Analysis & Report -> Remediation. | ||
| enable_write_tools: true | ||
| enable_subagent_tools: true | ||
| enable_mcp_tools: true | ||
| --- | ||
|
|
||
| # GCSFuse NPI Runner Agent | ||
|
|
||
| You are a specialized GCSFuse NPI Runner agent. Your mission is to execute the complete Node Platform Integration (NPI) validation workflow sequentially against GCE VM and GKE cluster targets. | ||
|
|
||
| ## Workflow Sequence | ||
| You must run the workflow stages strictly in the following sequential order: | ||
| 1. **SSH Connection Prep**: Clean up any stale sockets and establish persistent multiplexed SSH connections. | ||
| 2. **Conformance Testing**: Clone the GCSFuse repo and execute the integration test suite on the target VM, producing `conformance_results.json`. | ||
| 3. **Performance Benchmarking**: Build/push benchmarking images, run the benchmark suite via `npi_orchestrator.py`, and upload metrics to BigQuery. | ||
| 4. **Analysis**: Extract metrics, compare throughput/latency against baselines, and compile `npi_validation_report.md`. | ||
| 5. **Remediation**: Analyze conformance failures and configuration mismatches, producing `npi_remediation_plan.md`. | ||
| 6. **Verification**: Execute `verify_agent_workflow.py` to programmatically verify all deliverables are valid. | ||
|
|
||
| ## Key Constraints | ||
| - **Sequential Execution**: Do not run conformance testing and performance benchmarking concurrently on target VMs to avoid resource contention. | ||
| - **Socket Cleanup**: Stale socket files (`~/.ssh/sockets/<target>.sock`) must be checked and deleted before establishing master SSH connections. | ||
| - **Agnostic Code**: Do not hardcode VM or cluster names in execution scripts. Keep configurations dynamic via targets inputs. | ||
| - **User-Defined Targets**: You must not guess or auto-discover target GCE VM names, GKE cluster names, or GCS bucket names. You must explicitly extract these details from the user's prompt or request and write them to `targets.json`. | ||
| - **Check Active State**: Before executing the SSH connections or starting a benchmark run, check if `~/.npi/npi_run_state.json` exists locally. If it exists and contains active target statuses (e.g. `RUNNING` or `SUCCESS`), notify the user of the active/previous run state, and ask if they would like to re-attach/resume or trigger a clean reset (using `--reset`). | ||
| - **Analyze Permission Failures**: Conformance tests are expected to have failures due to intentionally restricted permissions. Do not block the pipeline trying to resolve these or force all tests to pass. Instead, analyze the failure reasons (e.g., identify which service accounts lack which GCS permissions) and detail them clearly in `npi_validation_report.md`. | ||
| - **Stall Monitoring**: Monitor both conformance tests and performance benchmarks for stalls. For conformance tests, verify that `~/integration_tests.log` size increases. If the log size remains unchanged for more than 5 minutes while the `go test` process is running, consider it stalled, immediately terminate the run, force-unmount leftovers, clean up temp directories to reclaim inodes, and document the details. For performance benchmarks, ensure `npi_orchestrator.py` has `MAX_INACTIVITY_SECS` configured appropriately (typically 14400s or 4 hours for full runs) so it auto-aborts and reports hangs. | ||
| - **No Automated Remediation**: Do not automatically perform or execute any remediation steps on the GCE VMs or GKE nodes. Document findings and suggest remediation recommendations in `npi_remediation_plan.md` as an advisory, but do not apply or execute them. | ||
| - **Independent Target Evaluation**: Unless otherwise specified, multiple benchmark runs executed together are separate and not directly comparable. Do not compare their metrics directly against each other. Present the performance results for each target in separate sections, evaluating each target independently against its own baseline. | ||
| - **RAM Buffer Fallback**: For targets without local SSDs (`has_ssd: false`), verify that the VM host has at least 600GB of RAM. If so, mount a 600GB memory volume (`tmpfs`) at the configured `buffer_mount` directory as the performance test buffer using the setup script. | ||
|
|
||
| ## Required Input Parameters | ||
| Before starting execution, extract the list of target validation environments from the user's request: | ||
| - **Validation Targets**: A list of one or more targets to run. Each target can be GCE (VM name, zone, bucket, BQ dataset, buffer mount SSD options) or GKE (cluster name, location, VM name, zone, bucket, BQ dataset, node selector, etc.), in any combination (e.g., multiple GCE, multiple GKE, or a mix of both). | ||
|
|
||
| If the target configuration is missing or ambiguous in the request, ask the user to specify them. Once collected, write the entire list of targets to `targets.json` to parameterize the execution. | ||
|
|
||
| ## Skills & Methods | ||
| Refer to the modular skills in the workspace for step-by-step guidance: | ||
| - SSH Connection: `.gemini/skills/ssh-connection-management/SKILL.md` | ||
| - Conformance: `.gemini/skills/conformance-testing/SKILL.md` | ||
| - Build & Setup: `.gemini/skills/benchmark-build-setup/SKILL.md` | ||
| - Benchmarking: `.gemini/skills/benchmark-suite-execution/SKILL.md` | ||
| - Analysis: `.gemini/skills/analysis-report-generation/SKILL.md` | ||
| - Remediation: `.gemini/skills/remediation-advisor/SKILL.md` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| --- | ||
| name: analysis-report-generation | ||
| description: Guides on querying benchmark results from BigQuery, comparing performance metrics (throughput/latency) against baselines, and generating a structured validation report. | ||
| --- | ||
|
|
||
| # GCSFuse NPI Analysis & Report Generation | ||
|
|
||
| This skill guides you through querying benchmark results from BigQuery tables, performing analysis on throughput and latency trends against historical baselines, verifying machine type configuration optimizations, and compiling the findings into a standard `npi_validation_report.md`. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **GCP/BigQuery Access**: The environment must have access to BigQuery dataset containing the benchmark outputs. | ||
| 2. **Baselines Datasets**: Ensure you know the baseline dataset ID (e.g. `npi_benchmarks_baseline_lro_on` or similar) and the newly generated run's dataset ID. | ||
| 3. **GCSFuse Source Code**: Access to GCSFuse code is required to inspect `params.yaml` for machine type verification. | ||
|
|
||
| ## Step-by-Step Procedure | ||
|
|
||
| ### Step 1: Query BigQuery Results | ||
|
|
||
| Retrieve performance data from the respective benchmark tables (e.g., `go_client_read_http1`, `go_client_read_grpc`, `fio_write_grpc`). | ||
|
|
||
| > [!IMPORTANT] | ||
| > **JSON Key Spacing**: In the FIO JSON output, the version is stored under the key `"fio version"` (with a space). Always query it using the quoted format: `JSON_VALUE(fio_json_output, '$."fio version"')` to avoid returning `NULL`. | ||
|
|
||
| Run queries using the `bq` CLI or a python BigQuery client: | ||
| ```bash | ||
| bq query --project_id=<PROJECT_ID> --use_legacy_sql=false \ | ||
| "SELECT | ||
| run_timestamp, | ||
| iteration, | ||
| JSON_VALUE(fio_json_output, '\$.\"fio version\"') AS fio_version, | ||
| AVG(SAFE_CAST(JSON_VALUE(job.read.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_read_bw_mb, | ||
| AVG(SAFE_CAST(JSON_VALUE(job.write.bw) AS FLOAT64)) * 1024.0 / 1000000.0 AS avg_write_bw_mb, | ||
| AVG(SAFE_CAST(JSON_VALUE(job.read.clat_ns.mean) AS FLOAT64)) / 1000000.0 AS avg_read_clat_ms | ||
| FROM | ||
| \`<PROJECT_ID>.<DATASET_ID>.<TABLE_ID>\`, | ||
| UNNEST(JSON_EXTRACT_ARRAY(fio_json_output.jobs)) AS job | ||
| GROUP BY 1, 2, 3 | ||
| ORDER BY run_timestamp DESC" | ||
| ``` | ||
|
|
||
| ### Step 2: Compare Against Baselines | ||
|
|
||
| Execute comparison scripts (e.g. `query_results.py`) or calculate the percentage difference in throughput/latency between baseline and regression datasets. | ||
|
|
||
| > [!IMPORTANT] | ||
| > **No Cross-Target Comparisons**: Performance results from different targets (e.g., GKE Node runs vs GCE VM runs) represent distinct platforms and are not directly comparable. Do not compare them against each other, compute cross-target deltas, or label differences between them as regressions. Compare each environment exclusively against its own respective historical baseline. | ||
|
|
||
| Example Comparison Matrix: | ||
| | Protocol | Baseline Throughput (MiB/s) | New Run Throughput (MiB/s) | Delta (%) | Status | | ||
| | :--- | :--- | :--- | :--- | :--- | | ||
| | HTTP/1.1 | 1240.5 | 1235.2 | -0.4% | Neutral | | ||
| | gRPC | 3450.0 | 2890.5 | -16.2% | **REGRESSION** | | ||
|
|
||
| ### Step 3: Verify Machine Type Configuration | ||
|
|
||
| Verify if the GCE VM or GKE node machine type (e.g., `c4-standard-96`) is classified under the high-performance machine types in the main GCSFuse repository: | ||
| 1. Locate `params.yaml` in the cloned GCSFuse repository. | ||
| 2. Search for the machine family or type. | ||
| 3. If missing, note it in the validation report as a required follow-up task (PR creation to add family). | ||
|
|
||
| ### Step 4: Generate `npi_validation_report.md` | ||
|
|
||
| Compile the queried results, baselines comparison, and machine family configuration verification into `npi_validation_report.md`. | ||
|
|
||
| For each target validation environment executed (GCE VM or GKE Cluster), create a separate section and performance table to isolate their metrics and prevent incorrect direct comparisons. | ||
|
|
||
| The report must follow this structure: | ||
| ```markdown | ||
| # GCSFuse NPI Validation Report | ||
|
|
||
| ## Executive Summary | ||
| [Brief description of whether the run meets performance criteria and if any regressions/failures were detected.] | ||
|
|
||
| ## Run Details | ||
| - **Timestamp**: [ISO 8601 Timestamp] | ||
| - **Target Platforms**: [List of all target names, e.g. GCE VM target-1, GKE Cluster target-2, etc.] | ||
|
|
||
| ## Target Performance Results | ||
|
|
||
| ### [TARGET_NAME_1] (Platform Type, e.g., GCE VM) | ||
| - **GCSFuse Version**: [e.g. v3.9.0] | ||
| - **Target Bucket**: [RAPID / Regional] | ||
| - **Performance Metrics vs Baseline**: | ||
| | Benchmark / Protocol | Baseline (Version) | Current Run (Version) | Delta (%) | Status | | ||
| |---|---|---|---|---| | ||
| | HTTP1 Read | 1250 MB/s | 1240 MB/s | -0.8% | PASS | | ||
| | gRPC Read | 3500 MB/s | 2800 MB/s | -20.0% | FAIL (Regression) | | ||
|
|
||
| ## High-Performance Machine Type Classification | ||
| - **Machine Type Used**: `c4-standard-96` | ||
| - **Configured in `params.yaml`?**: [Yes/No] | ||
| - **Action Required**: [None / Create PR in GCSFuse repo to add the machine type] | ||
|
|
||
| ## Observations & Issues | ||
| - [Detail any errors observed, e.g., TLS Handshake Errors, GKE OOMs, Direct Path fallback issues.] | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| --- | ||
| name: benchmark-build-setup | ||
| description: Guides on checking out GCSFuse, configuring targets (Docker/RAM disks), and building/pushing benchmarking images to Artifact Registry. | ||
| --- | ||
|
|
||
| # Benchmark Build and Setup for GCSFuse NPI | ||
|
|
||
| This skill guides you through checking out the GCSFuse repository, configuring target VMs with storage buffers and Docker, and building/pushing benchmark images to Google Artifact Registry. | ||
|
|
||
| ## Step 1: Clone/Prepare GCSFuse & Custom Matrices | ||
|
|
||
| 1. **Clone / Verify GCSFuse Repository**: | ||
| Verify that the GCSFuse repository or sub-module is checked out locally to the expected branch or tag. | ||
| 2. **Smoke-Test Matrix Customization**: | ||
| If running quick verification or smoke tests, edit the matrix files to execute only minimal iterations: | ||
| * Edit: `fio/read_matrix.csv` | ||
| * Edit: `fio/write_matrix.csv` | ||
| *(Note: Remember to run `git restore fio/read_matrix.csv fio/write_matrix.csv` after the images are built and pushed to avoid checking in modified matrices).* | ||
|
|
||
| ## Step 2: Configure Target VMs | ||
|
|
||
| Configure the storage buffer and Docker workspace on each target VM using the established SSH master connection socket. | ||
|
|
||
| ### A. Configure Storage Buffer | ||
| * **Unified Buffer Setup (Local SSD or RAM Fallback)**: | ||
| Execute `raid0-script.sh` on the target VM, passing the target mount path (from `targets.json`'s `buffer_mount`) as the argument. The script will automatically build a RAID0 array from local SSDs if present. If no local SSDs are found, it will verify that the host has at least 600GB of RAM and mount a 600GB memory volume (`tmpfs`) at the mount path instead: | ||
| ```bash | ||
| # Copy script to target | ||
| scp -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine raid0-script.sh <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com:~/raid0-script.sh | ||
|
|
||
| # Run script with the target mount path argument (e.g. /mnt/lssd) | ||
| ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "bash ~/raid0-script.sh <SSD_MOUNT_PATH>" | ||
| ``` | ||
|
|
||
| ### B. Install Docker & Configure Permissions | ||
| Install Docker on the target VM and add the SSH user to the docker group: | ||
| ```bash | ||
| ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "curl -fsSL https://get.docker.com -o get-docker.sh && sudo sh get-docker.sh && sudo usermod -aG docker \$USER && rm get-docker.sh" | ||
| ``` | ||
|
|
||
| **CRITICAL**: Since group memberships are only evaluated at session startup, recreate the SSH multiplexing socket to apply the docker group changes: | ||
| 1. Close socket: `rm -f ~/.ssh/sockets/<TARGET_NAME>.sock` | ||
| 2. Re-establish the connection socket (using the `ssh-connection-management` skill). | ||
|
|
||
| ### C. Configure Registry Access on Target | ||
| Enable the target VM docker daemon to pull images from Artifact Registry: | ||
| ```bash | ||
| ssh -S ~/.ssh/sockets/<TARGET_NAME>.sock -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/google_compute_engine <SSH_USER>@nic0.<VM_NAME>.<ZONE>.c.<PROJECT_ID>.internal.gcpnode.com "gcloud auth configure-docker us-docker.pkg.dev -q" | ||
| ``` | ||
|
|
||
| ## Step 3: Build & Push Benchmark Images | ||
|
|
||
| Build and push GCSFuse benchmarking images (with FIO/Go-Client inside) to your Google Artifact Registry: | ||
|
|
||
| 1. **Configure Registry Auth Locally**: | ||
| Ensure local credentials can write to the Artifact Registry: | ||
| ```bash | ||
| gcloud auth configure-docker us-docker.pkg.dev | ||
| ``` | ||
| 2. **Execute Build Script**: | ||
| ```bash | ||
| python3 build_images.py --project <PROJECT_ID> --image-version <IMAGE_VERSION> --gcsfuse-version <GCSFUSE_VERSION> | ||
| ``` | ||
| 3. **Restore Matrices**: | ||
| If you customized matrix files in Step 1, revert the local changes: | ||
| ```bash | ||
| git restore fio/read_matrix.csv fio/write_matrix.csv | ||
| ``` | ||
|
|
||
| ## Step 4: Verify Image Availability | ||
|
|
||
| Verify that the benchmark image is successfully pushed and available in Artifact Registry: | ||
| ```bash | ||
| gcloud artifacts docker images list us-docker.pkg.dev/<PROJECT_ID>/gcsfuse-npi-images --image-format='value(format("{0}:{1}",package,tag))' | grep "<IMAGE_VERSION>" | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| --- | ||
| name: benchmark-suite-execution | ||
| description: Guides on configuring, executing, monitoring, and exporting results of GCSFuse NPI benchmark runs on GCE and GKE. | ||
| --- | ||
|
|
||
| # Benchmark Suite Execution for GCSFuse NPI | ||
|
|
||
| This skill guides you through defining target environments, executing benchmarks concurrently on GCE and GKE, monitoring execution for hangs/failures, and verifying BigQuery export status. | ||
|
|
||
| ## Step 1: Configuration & Targets Setup | ||
|
|
||
| Before starting a run, collect target details and populate `targets.json` in the root configuration directory. | ||
|
|
||
| ### A. Collect Inputs | ||
| 1. **Target Platforms**: Determine if running on GCE VM, GKE cluster, or both. | ||
| 2. **GCE VM Details** (if applicable): VM Name, zone, SSD presence, SSD mount path (e.g. `/mnt/lssd` or `/tmp/npi_buffer`), and RAPID bucket usage. | ||
| 3. **GKE Details** (if applicable): Intermediate VM details, Cluster name, region/zone, SSD/RAM configuration, RAPID bucket usage, node selectors, resource limits. | ||
| 4. **GCS Buckets**: Target regional and/or RAPID (zonal) bucket names. | ||
| 5. **GCP Project**: GCP Project ID (e.g. `gcs-fuse-test`). | ||
|
|
||
| ### B. Configure `targets.json` | ||
| Populate `targets.json` with the corresponding target details. Format: | ||
| ```json | ||
| [ | ||
| { | ||
| "name": "gce-c4-ssd", | ||
| "type": "gce", | ||
| "vm_name": "<GCE_VM_NAME>", | ||
| "zone": "<GCE_ZONE>", | ||
| "bucket": "<REGIONAL_BUCKET>", | ||
| "dataset": "<BQ_DATASET_PREFIX>", | ||
| "buffer_mount": "<SSD_MOUNT_PATH>", | ||
| "has_ssd": true, | ||
| "is_rapid_bucket": false | ||
| }, | ||
| { | ||
| "name": "gke-tpu-slice", | ||
| "type": "gke", | ||
| "vm_name": "<GKE_INTERMEDIATE_VM_NAME>", | ||
| "zone": "<GKE_INTERMEDIATE_VM_ZONE>", | ||
| "cluster_name": "<GKE_CLUSTER_NAME>", | ||
| "location": "<GKE_CLUSTER_LOCATION>", | ||
| "bucket": "<REGIONAL_BUCKET>", | ||
| "dataset": "<BQ_DATASET_PREFIX>", | ||
| "node_selector": "cloud.google.com/gke-accelerator-count=4,cloud.google.com/gke-nodepool=ct6e-pool,cloud.google.com/gke-tpu-accelerator=tpu-v6e-slice,cloud.google.com/gke-tpu-topology=2x2", | ||
| "resources_limits": "google.com/tpu=4", | ||
| "has_ssd": false, | ||
| "is_rapid_bucket": true | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| ## Step 2: Prerequisites Validation | ||
|
|
||
| 1. **Local Authentication**: Ensure `gcloud` and `kubectl` are configured locally. | ||
| 2. **Access Verification**: Verify direct path routing and quota if using RAPID buckets. | ||
| 3. **Logging**: Ensure all execution commands are logged locally/remotely to `npi_commands.log`. | ||
|
|
||
| ## Step 3: Execute Orchestrated Benchmarks | ||
|
|
||
| 1. **State Reset**: | ||
| * **Clean Run / Retrigger**: If starting a completely fresh run or recovering from a corrupted state, clean up state files: | ||
| ```bash | ||
| rm -f ~/.npi/npi_run_state.json | ||
| ``` | ||
| *(This forces a clean retry, terminating active containers, mounts, and jobs before sync and relaunch).* | ||
| * **Resume Run**: To resume an active background run without starting over, keep the state file intact. | ||
|
|
||
| 2. **Execute Orchestrator**: | ||
| Run the orchestrator script, specifying the benchmarks, image version, and iterations: | ||
| ```bash | ||
| python3 npi_orchestrator.py --benchmarks "<BENCHMARK_LIST>" --image-version <IMAGE_VERSION> --iterations <ITERATION_COUNT> | ||
| ``` | ||
| Examples of `<BENCHMARK_LIST>`: `read_parallel,write_parallel` or `all`. | ||
|
|
||
| ## Step 4: Monitor and Safety Policies | ||
|
|
||
| Observe logs for the following active safety policies enforced by the orchestrator: | ||
|
|
||
| 1. **Inactivity Timeout**: | ||
| * Logs are monitored continuously by `npi_orchestrator.py`. | ||
| * If no log output is detected for 5 minutes (300s), the run is automatically aborted to protect against hangs. | ||
| 2. **Disk Space Protection**: | ||
| * If target storage buffer disk space exceeds 85%, GCE runs are immediately aborted to prevent out-of-disk failures. | ||
| 3. **GKE TPU Memory Management**: | ||
| * Ensure `--use-memory-volumes` flag is enabled in `npi_gke.py` to mount buffers in RAM. | ||
| * Avoid running file cache tests (`read_file_cache`) on TPU slices to prevent host Out-Of-Memory (OOM) situations. | ||
|
|
||
| ## Step 5: Verify BigQuery Results Export | ||
|
|
||
| Upon successful completion, the orchestrator uploads FIO and Go Client JSON output to BigQuery. | ||
| Verify the upload: | ||
| 1. Locate the dataset: `<BQ_DATASET_PREFIX>` configured in `targets.json`. | ||
| 2. Query the BQ table using bq tool or Google Cloud Console: | ||
| ```sql | ||
| SELECT COUNT(*) FROM `<PROJECT_ID>.<BQ_DATASET_PREFIX>_dataset.fio_results` WHERE image_version = '<IMAGE_VERSION>' | ||
| ``` | ||
| 3. Ensure the count matches the expected number of iterations and test runs. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.