Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
2e4ef40
docs: add GCE and GKE documentation for running NPI benchmarks
kislaykishore Apr 10, 2026
db7dd1e
docs: use roles/storage.objectUser instead of objectAdmin
kislaykishore Apr 10, 2026
017be3e
docs: remove unnecessary privileged: true from GKE pod spec
kislaykishore Apr 10, 2026
b14c79f
docs: add lscpu and python3 to prerequisites in GCE docs
kislaykishore Apr 10, 2026
57a1d6b
docs: replace hardcoded v3.5.6 with YOUR_GCSFUSE_VERSION template
kislaykishore Apr 10, 2026
83c7e0c
docs: extract GKE pod specs to gke_pod_specs directory
kislaykishore Apr 10, 2026
d312302
docs: add run_gke_benchmarks.sh to orchestrate sequential pod execution
kislaykishore Apr 10, 2026
b26ed99
refactor: use Kubernetes Jobs instead of Pods for GKE benchmarks
kislaykishore Apr 10, 2026
61a05fe
docs: add bigquery analysis scripts for extracting throughput and lat…
kislaykishore Apr 10, 2026
418c2ba
docs: add GKE cluster setup requirements (node pools and node-selectors)
kislaykishore Apr 10, 2026
3005701
docs: improve clarity and make instructions production-ready for vendors
kislaykishore Apr 10, 2026
df359cf
docs: explicitly state Workload Identity and GCS FUSE CSI driver requ…
kislaykishore Apr 10, 2026
222bc10
docs: add get-credentials step for authenticating to GKE cluster
kislaykishore Apr 10, 2026
a6e0e5c
feat: add executable Jupyter Notebook playbook for GKE benchmarks
kislaykishore Apr 10, 2026
95f07d3
feat: add executable Jupyter Notebook playbook for GCE benchmarks
kislaykishore Apr 10, 2026
e0d0447
feat: add CLI bash scripts for automated playbook execution
kislaykishore Apr 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions npi/bq_queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# BigQuery Performance Analysis Queries

After running the FIO NPI benchmarks and collecting the data in BigQuery, you can use the following queries to extract throughput and latency characteristics.

The benchmark runner uploads the full FIO JSON output into a native `JSON` column named `fio_json_output`. We can use BigQuery's native JSON accessors (e.g., `fio_json_output.jobs[0].read.bw`) to extract the metrics.

*Note: In FIO's JSON output, `bw` is reported in KiB/s, and completion latency `clat_ns.mean` is reported in nanoseconds.*

## 1. Extract Raw Throughput and Latency per Iteration

This query retrieves the read and write throughput (in MiB/s) and the mean completion latency (in ms) for every iteration of your benchmark.

```sql
SELECT
run_timestamp,
iteration,
fio_env,

-- Throughput (FIO 'bw' is in KiB/s -> convert to MiB/s)
FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s,
FLOAT64(fio_json_output.jobs[0].write.bw) / 1024 AS write_throughput_mib_s,

-- Latency (FIO 'clat_ns.mean' is in nanoseconds -> convert to ms)
FLOAT64(fio_json_output.jobs[0].read.clat_ns.mean) / 1000000.0 AS read_clat_mean_ms,
FLOAT64(fio_json_output.jobs[0].write.clat_ns.mean) / 1000000.0 AS write_clat_mean_ms

FROM
`YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.YOUR_TABLE_ID`
ORDER BY
run_timestamp DESC,
iteration ASC;
```

## 2. Average Performance Across All Iterations (Grouped by Environment)

This query is useful when you have run multiple iterations (e.g., `--iterations=5`) and want the average throughput and latency grouped by the FIO environment variables (e.g., block size, threads, etc.).

```sql
SELECT
fio_env,

-- Average Read Metrics
AVG(FLOAT64(fio_json_output.jobs[0].read.bw)) / 1024 AS avg_read_throughput_mib_s,
AVG(FLOAT64(fio_json_output.jobs[0].read.clat_ns.mean)) / 1000000.0 AS avg_read_clat_mean_ms,

-- Average Write Metrics
AVG(FLOAT64(fio_json_output.jobs[0].write.bw)) / 1024 AS avg_write_throughput_mib_s,
AVG(FLOAT64(fio_json_output.jobs[0].write.clat_ns.mean)) / 1000000.0 AS avg_write_clat_mean_ms,

-- Count the number of iterations that made up this average
COUNT(iteration) as iteration_count

FROM
`YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.YOUR_TABLE_ID`
GROUP BY
fio_env
ORDER BY
avg_read_throughput_mib_s DESC;
```

## 3. Compare Two Different Tables (e.g., HTTP/1.1 vs gRPC)

If you have emitted HTTP/1.1 results to one table and gRPC results to another, you can use `UNION ALL` to compare them side-by-side.

```sql
WITH combined_results AS (
SELECT
'HTTP/1.1' AS protocol,
fio_env,
FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s
FROM
`YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.fio_read_http1`

UNION ALL

SELECT
'gRPC' AS protocol,
fio_env,
FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s
FROM
`YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.fio_read_grpc`
)

SELECT
protocol,
fio_env,
AVG(read_throughput_mib_s) AS avg_read_throughput_mib_s,
APPROX_QUANTILES(read_throughput_mib_s, 100)[OFFSET(50)] AS median_read_throughput_mib_s
FROM
combined_results
GROUP BY
protocol, fio_env
ORDER BY
fio_env, protocol;
```
126 changes: 126 additions & 0 deletions npi/gce_npi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Running NPI Benchmarks on GCE

This guide explains how to build the required Docker images and run the NPI (Network Performance Improvement) benchmarks on a Google Compute Engine (GCE) VM.

> **Note to Operators / Vendors:** Please ensure you have completed the prerequisite steps and gathered all required variables before executing the scripts.

## Variables Glossary

Before starting, gather the following information. You will need to substitute these placeholders in the commands throughout this guide:
* `YOUR_PROJECT_ID`: The GCP project ID where your resources (Artifact Registry, GCS, BigQuery) reside.
* `YOUR_GCS_BUCKET`: The GCS bucket name used for reading/writing test data (e.g., `my-benchmark-bucket` — omit the `gs://` prefix).
* `YOUR_BQ_DATASET_ID`: The BigQuery dataset where the benchmark results will be inserted (e.g., `npi_results`).
* `YOUR_GCSFUSE_VERSION`: The GCSFuse version tag to test (e.g., `v3.5.6`).

## Prerequisites

1. **Google Cloud Project**: You need a GCP project to host your Artifact Registry, Cloud Storage bucket, and BigQuery dataset.
2. **VM Setup**: A GCE VM with Docker installed. The VM's service account must have the following scopes:
* `https://www.googleapis.com/auth/bigquery`
* `https://www.googleapis.com/auth/devstorage.read_write`
* `https://www.googleapis.com/auth/cloud-platform` (or granular scopes)
3. **System Utilities**: The `npi.py` script requires `python3`. Additionally, the `lscpu` command-line utility is required for NUMA-aware benchmarks. These are usually pre-installed, but you can ensure they are present by installing the `util-linux` and `python3` packages:
```bash
sudo apt-get update && sudo apt-get install -y util-linux python3
```
4. **Authentication**: If not using a VM service account, you must authenticate using `gcloud auth login` and `gcloud auth application-default login`.
5. **BigQuery Dataset**: Create a BigQuery dataset to store the benchmark results.
6. **Artifact Registry**: You must have an Artifact Registry repository to host the Docker images.
```bash
gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID
gcloud artifacts repositories create gcsfuse-benchmarks \
--repository-format=docker \
--location=us \
--project=YOUR_PROJECT_ID
```

## Step 1: Build the Benchmark Images

The NPI benchmarks use Docker containers to provide an isolated environment for running tests. You can build and push these images to your Artifact Registry using the provided `Makefile`.

By default, the `Makefile` builds for project `gcs-fuse-test`. You should either edit the `Makefile` or pass your project ID to the `make` command:

```bash
cd gcsfuse-tools/npi

# Option 1: Override project and version in the make command
make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=YOUR_GCSFUSE_VERSION

# Example:
# make build PROJECT=my-project GCSFUSE_VERSION=v3.5.6

# Option 2: Edit Makefile directly to change the PROJECT variable, then run:
make build
```

This will trigger a Cloud Build job (`cloudbuild.yaml`) that builds the `fio-read-benchmark`, `fio-write-benchmark`, and `orbax-emulated-benchmark` images and pushes them to `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/`.

> **Verification:** Once the `make build` command finishes, verify the images exist in your Artifact Registry before proceeding to the next step:
> ```bash
> gcloud artifacts docker images list us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks
> ```

## Step 2: Run the Benchmarks

Once the images are built, you can use the `npi.py` script to orchestrate the benchmark runs. The script automatically generates the correct `docker run` commands based on the desired benchmarks and configurations.

### Basic Usage

Run all available benchmarks:

```bash
python3 npi.py \
--benchmarks 'all' \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version YOUR_GCSFUSE_VERSION

# Example:
# python3 npi.py --benchmarks 'all' --bucket-name my-bucket --project-id my-project --bq-dataset-id my_dataset --gcsfuse-version v3.5.6
```

### Specifying Benchmarks

You can specify a space-separated list of specific benchmarks to run. For example, to run only the `read_http1` and `write_grpc` benchmarks:

```bash
python3 npi.py \
--benchmarks read_http1 write_grpc \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version YOUR_GCSFUSE_VERSION
```

### Understanding the Parameters

* `--benchmarks`: 'all' or a list of specific benchmarks (e.g., `read_http1`, `orbax_read_grpc_numa0_fio_bound`).
* `--bucket-name`: The GCS bucket where FIO will read/write data.
* `--project-id`: The GCP Project ID containing your BigQuery dataset.
* `--bq-dataset-id`: The BigQuery dataset where results will be inserted.
* `--gcsfuse-version`: The version tag used when building the images (e.g., `v3.5.6`).
* `--iterations`: (Optional) Number of FIO test iterations. Default is 5.
* `--temp-dir`: (Optional) FUSE temp directory type (`boot-disk` or `memory`).

### Dry Run

To see exactly what `docker run` commands the script will execute without actually running them, add the `--dry-run` flag:

```bash
python3 npi.py \
--benchmarks read_http1 \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version YOUR_GCSFUSE_VERSION \
--dry-run
```

> **Troubleshooting Tip:** If `npi.py` throws permissions errors, ensure you have run `gcloud auth application-default login` or that your VM Service Account has the required scopes (`cloud-platform` and `bigquery`).

## Step 3: Analyze Results

Once the benchmarks complete, the results are populated automatically into your BigQuery dataset (each benchmark gets its own table).

To extract useful performance characteristics such as throughput (MiB/s) and latency (ms) from the raw FIO JSON in BigQuery, refer to the [BigQuery Performance Analysis Queries](bq_queries.md) guide.
135 changes: 135 additions & 0 deletions npi/gce_npi_playbook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
{
"cells": [
{
"id": "3cb87511",
"cell_type": "markdown",
"source": [
"# GCE NPI Benchmark Playbook\n",
"\n",
"This runnable playbook guides you through building benchmark images and remotely orchestrating NPI benchmarks on a Google Compute Engine (GCE) VM. \n",
"\n",
"Unlike GKE, which runs Jobs natively, GCE requires us to execute `npi.py` which runs Docker containers on the target machine. This notebook executes those commands remotely via `gcloud compute ssh`.\n",
"\n",
"### Instructions\n",
"1. Ensure your GCE VM is already created and running.\n",
"2. Run each cell sequentially.\n",
"3. If a cell fails, stop and resolve the error before continuing."
],
"metadata": {},
"execution_count": null
},
{
"id": "941d6c12",
"cell_type": "code",
"source": [
"# --- CONFIGURATION VARIABLES ---\n",
"# Replace these with your actual environment details before running any other cells.\n",
"\n",
"PROJECT_ID = \"YOUR_PROJECT_ID\"\n",
"GCSFUSE_VERSION = \"v3.5.6\" # Example: v3.5.6\n",
"BUCKET_NAME = \"YOUR_BUCKET_NAME\"\n",
"BQ_DATASET_ID = \"YOUR_BQ_DATASET_ID\"\n",
"\n",
"# GCE Specific Variables\n",
"VM_NAME = \"YOUR_VM_NAME\"\n",
"VM_ZONE = \"YOUR_VM_ZONE\" # e.g., us-central1-a"
],
"metadata": {},
"execution_count": null
},
{
"id": "807a6d10",
"cell_type": "markdown",
"source": [
"## Step 1: Build Benchmark Images\n",
"This will use Google Cloud Build to construct the Docker images required for testing and upload them to Artifact Registry."
],
"metadata": {},
"execution_count": null
},
{
"id": "b21dfe48",
"cell_type": "code",
"source": [
"!gcloud config set project {PROJECT_ID}\n",
"\n",
"# Enable Artifact Registry API and create the repository (errors ignored if it already exists)\n",
"!gcloud services enable artifactregistry.googleapis.com --project={PROJECT_ID}\n",
"!gcloud artifacts repositories create gcsfuse-benchmarks --repository-format=docker --location=us --project={PROJECT_ID} || echo \"Repository might already exist.\"\n",
"\n",
"# Build the images using the Makefile\n",
"!make build PROJECT={PROJECT_ID} GCSFUSE_VERSION={GCSFUSE_VERSION}\n",
"\n",
"# Verify Images\n",
"!gcloud artifacts docker images list us-docker.pkg.dev/{PROJECT_ID}/gcsfuse-benchmarks"
],
"metadata": {},
"execution_count": null
},
{
"id": "276923c2",
"cell_type": "markdown",
"source": [
"## Step 2: Configure the Target GCE VM\n",
"Ensure the remote VM has the necessary tools (Docker, python3, lscpu) and copy the benchmarking tools over."
],
"metadata": {},
"execution_count": null
},
{
"id": "79018b83",
"cell_type": "code",
"source": [
"# Install system dependencies\n",
"!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"sudo apt-get update \u0026\u0026 sudo apt-get install -y util-linux python3 docker.io\"\n",
"\n",
"# Ensure the user has permissions to run docker without sudo\n",
"!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"sudo usermod -aG docker \\$USER\"\n",
"\n",
"# Authenticate docker to artifact registry so the VM can pull the benchmark images\n",
"!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"gcloud auth configure-docker us-docker.pkg.dev --quiet\"\n",
"\n",
"# Copy the benchmarking code to the VM\n",
"!gcloud compute scp --recurse ../npi ../fio {VM_NAME}:~ --zone={VM_ZONE} --project={PROJECT_ID}"
],
"metadata": {},
"execution_count": null
},
{
"id": "d6f4cb37",
"cell_type": "markdown",
"source": [
"## Step 3: Run the Benchmarks\n",
"Execute the `npi.py` script remotely on the VM. This script automatically downloads the images and orchestrates the benchmark test matrix sequentially."
],
"metadata": {},
"execution_count": null
},
{
"id": "da78aa92",
"cell_type": "code",
"source": [
"# We use 'sg docker -c' to ensure the user's new docker group membership is picked up without requiring them to log out and back in\n",
"benchmark_cmd = f\"cd npi \u0026\u0026 sg docker -c 'python3 npi.py --benchmarks all --bucket-name {BUCKET_NAME} --project-id {PROJECT_ID} --bq-dataset-id {BQ_DATASET_ID} --gcsfuse-version {GCSFUSE_VERSION}'\"\n",
"\n",
"print(f\"Executing: {benchmark_cmd} on {VM_NAME}...\")\n",
"\n",
"!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"{benchmark_cmd}\""
],
"metadata": {},
"execution_count": null
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat_minor": 5,
"nbformat": 4
}
Loading
Loading