GoogleCloudPlatform · kislaykishore · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/npi/bq_queries.md b/npi/bq_queries.md
@@ -0,0 +1,95 @@
+# BigQuery Performance Analysis Queries
+
+After running the FIO NPI benchmarks and collecting the data in BigQuery, you can use the following queries to extract throughput and latency characteristics.
+
+The benchmark runner uploads the full FIO JSON output into a native `JSON` column named `fio_json_output`. We can use BigQuery's native JSON accessors (e.g., `fio_json_output.jobs[0].read.bw`) to extract the metrics.
+
+*Note: In FIO's JSON output, `bw` is reported in KiB/s, and completion latency `clat_ns.mean` is reported in nanoseconds.*
+
+## 1. Extract Raw Throughput and Latency per Iteration
+
+This query retrieves the read and write throughput (in MiB/s) and the mean completion latency (in ms) for every iteration of your benchmark.
+
+```sql
+SELECT
+  run_timestamp,
+  iteration,
+  fio_env,
+
+  -- Throughput (FIO 'bw' is in KiB/s -> convert to MiB/s)
+  FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s,
+  FLOAT64(fio_json_output.jobs[0].write.bw) / 1024 AS write_throughput_mib_s,
+
+  -- Latency (FIO 'clat_ns.mean' is in nanoseconds -> convert to ms)
+  FLOAT64(fio_json_output.jobs[0].read.clat_ns.mean) / 1000000.0 AS read_clat_mean_ms,
+  FLOAT64(fio_json_output.jobs[0].write.clat_ns.mean) / 1000000.0 AS write_clat_mean_ms
+
+FROM
+  `YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.YOUR_TABLE_ID`
+ORDER BY
+  run_timestamp DESC, 
+  iteration ASC;
+```
+
+## 2. Average Performance Across All Iterations (Grouped by Environment)
+
+This query is useful when you have run multiple iterations (e.g., `--iterations=5`) and want the average throughput and latency grouped by the FIO environment variables (e.g., block size, threads, etc.).
+
+```sql
+SELECT
+  fio_env,
+
+  -- Average Read Metrics
+  AVG(FLOAT64(fio_json_output.jobs[0].read.bw)) / 1024 AS avg_read_throughput_mib_s,
+  AVG(FLOAT64(fio_json_output.jobs[0].read.clat_ns.mean)) / 1000000.0 AS avg_read_clat_mean_ms,
+
+  -- Average Write Metrics
+  AVG(FLOAT64(fio_json_output.jobs[0].write.bw)) / 1024 AS avg_write_throughput_mib_s,
+  AVG(FLOAT64(fio_json_output.jobs[0].write.clat_ns.mean)) / 1000000.0 AS avg_write_clat_mean_ms,
+
+  -- Count the number of iterations that made up this average
+  COUNT(iteration) as iteration_count
+
+FROM
+  `YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.YOUR_TABLE_ID`
+GROUP BY
+  fio_env
+ORDER BY
+  avg_read_throughput_mib_s DESC;
+```
+
+## 3. Compare Two Different Tables (e.g., HTTP/1.1 vs gRPC)
+
+If you have emitted HTTP/1.1 results to one table and gRPC results to another, you can use `UNION ALL` to compare them side-by-side.
+
+```sql
+WITH combined_results AS (
+  SELECT
+    'HTTP/1.1' AS protocol,
+    fio_env,
+    FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s
+  FROM
+    `YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.fio_read_http1`
+
+  UNION ALL
+
+  SELECT
+    'gRPC' AS protocol,
+    fio_env,
+    FLOAT64(fio_json_output.jobs[0].read.bw) / 1024 AS read_throughput_mib_s
+  FROM
+    `YOUR_PROJECT_ID.YOUR_BQ_DATASET_ID.fio_read_grpc`
+)
+
+SELECT
+  protocol,
+  fio_env,
+  AVG(read_throughput_mib_s) AS avg_read_throughput_mib_s,
+  APPROX_QUANTILES(read_throughput_mib_s, 100)[OFFSET(50)] AS median_read_throughput_mib_s
+FROM
+  combined_results
+GROUP BY
+  protocol, fio_env
+ORDER BY
+  fio_env, protocol;
+```
diff --git a/npi/gce_npi.md b/npi/gce_npi.md
@@ -0,0 +1,126 @@
+# Running NPI Benchmarks on GCE
+
+This guide explains how to build the required Docker images and run the NPI (Network Performance Improvement) benchmarks on a Google Compute Engine (GCE) VM.
+
+> **Note to Operators / Vendors:** Please ensure you have completed the prerequisite steps and gathered all required variables before executing the scripts.
+
+## Variables Glossary
+
+Before starting, gather the following information. You will need to substitute these placeholders in the commands throughout this guide:
+*   `YOUR_PROJECT_ID`: The GCP project ID where your resources (Artifact Registry, GCS, BigQuery) reside.
+*   `YOUR_GCS_BUCKET`: The GCS bucket name used for reading/writing test data (e.g., `my-benchmark-bucket` — omit the `gs://` prefix).
+*   `YOUR_BQ_DATASET_ID`: The BigQuery dataset where the benchmark results will be inserted (e.g., `npi_results`).
+*   `YOUR_GCSFUSE_VERSION`: The GCSFuse version tag to test (e.g., `v3.5.6`).
+
+## Prerequisites
+
+1.  **Google Cloud Project**: You need a GCP project to host your Artifact Registry, Cloud Storage bucket, and BigQuery dataset.
+2.  **VM Setup**: A GCE VM with Docker installed. The VM's service account must have the following scopes:
+    *   `https://www.googleapis.com/auth/bigquery`
+    *   `https://www.googleapis.com/auth/devstorage.read_write`
+    *   `https://www.googleapis.com/auth/cloud-platform` (or granular scopes)
+3.  **System Utilities**: The `npi.py` script requires `python3`. Additionally, the `lscpu` command-line utility is required for NUMA-aware benchmarks. These are usually pre-installed, but you can ensure they are present by installing the `util-linux` and `python3` packages:
+    ```bash
+    sudo apt-get update && sudo apt-get install -y util-linux python3
+    ```
+4.  **Authentication**: If not using a VM service account, you must authenticate using `gcloud auth login` and `gcloud auth application-default login`.
+5.  **BigQuery Dataset**: Create a BigQuery dataset to store the benchmark results.
+6.  **Artifact Registry**: You must have an Artifact Registry repository to host the Docker images.
+    ```bash
+    gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID
+    gcloud artifacts repositories create gcsfuse-benchmarks \
+        --repository-format=docker \
+        --location=us \
+        --project=YOUR_PROJECT_ID
+    ```
+
+## Step 1: Build the Benchmark Images
+
+The NPI benchmarks use Docker containers to provide an isolated environment for running tests. You can build and push these images to your Artifact Registry using the provided `Makefile`.
+
+By default, the `Makefile` builds for project `gcs-fuse-test`. You should either edit the `Makefile` or pass your project ID to the `make` command:
+
+```bash
+cd gcsfuse-tools/npi
+
+# Option 1: Override project and version in the make command
+make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=YOUR_GCSFUSE_VERSION
+
+# Example:
+# make build PROJECT=my-project GCSFUSE_VERSION=v3.5.6
+
+# Option 2: Edit Makefile directly to change the PROJECT variable, then run:
+make build
+```
+
+This will trigger a Cloud Build job (`cloudbuild.yaml`) that builds the `fio-read-benchmark`, `fio-write-benchmark`, and `orbax-emulated-benchmark` images and pushes them to `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/`.
+
+> **Verification:** Once the `make build` command finishes, verify the images exist in your Artifact Registry before proceeding to the next step:
+> ```bash
+> gcloud artifacts docker images list us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks
+> ```
+
+## Step 2: Run the Benchmarks
+
+Once the images are built, you can use the `npi.py` script to orchestrate the benchmark runs. The script automatically generates the correct `docker run` commands based on the desired benchmarks and configurations.
+
+### Basic Usage
+
+Run all available benchmarks:
+
+```bash
+python3 npi.py \
+    --benchmarks 'all' \
+    --bucket-name YOUR_GCS_BUCKET \
+    --project-id YOUR_PROJECT_ID \
+    --bq-dataset-id YOUR_BQ_DATASET_ID \
+    --gcsfuse-version YOUR_GCSFUSE_VERSION
+
+# Example:
+# python3 npi.py --benchmarks 'all' --bucket-name my-bucket --project-id my-project --bq-dataset-id my_dataset --gcsfuse-version v3.5.6
+```
+
+### Specifying Benchmarks
+
+You can specify a space-separated list of specific benchmarks to run. For example, to run only the `read_http1` and `write_grpc` benchmarks:
+
+```bash
+python3 npi.py \
+    --benchmarks read_http1 write_grpc \
+    --bucket-name YOUR_GCS_BUCKET \
+    --project-id YOUR_PROJECT_ID \
+    --bq-dataset-id YOUR_BQ_DATASET_ID \
+    --gcsfuse-version YOUR_GCSFUSE_VERSION
+```
+
+### Understanding the Parameters
+
+*   `--benchmarks`: 'all' or a list of specific benchmarks (e.g., `read_http1`, `orbax_read_grpc_numa0_fio_bound`).
+*   `--bucket-name`: The GCS bucket where FIO will read/write data.
+*   `--project-id`: The GCP Project ID containing your BigQuery dataset.
+*   `--bq-dataset-id`: The BigQuery dataset where results will be inserted.
+*   `--gcsfuse-version`: The version tag used when building the images (e.g., `v3.5.6`).
+*   `--iterations`: (Optional) Number of FIO test iterations. Default is 5.
+*   `--temp-dir`: (Optional) FUSE temp directory type (`boot-disk` or `memory`).
+
+### Dry Run
+
+To see exactly what `docker run` commands the script will execute without actually running them, add the `--dry-run` flag:
+
+```bash
+python3 npi.py \
+    --benchmarks read_http1 \
+    --bucket-name YOUR_GCS_BUCKET \
+    --project-id YOUR_PROJECT_ID \
+    --bq-dataset-id YOUR_BQ_DATASET_ID \
+    --gcsfuse-version YOUR_GCSFUSE_VERSION \
+    --dry-run
+```
+
+> **Troubleshooting Tip:** If `npi.py` throws permissions errors, ensure you have run `gcloud auth application-default login` or that your VM Service Account has the required scopes (`cloud-platform` and `bigquery`).
+
+## Step 3: Analyze Results
+
+Once the benchmarks complete, the results are populated automatically into your BigQuery dataset (each benchmark gets its own table). 
+
+To extract useful performance characteristics such as throughput (MiB/s) and latency (ms) from the raw FIO JSON in BigQuery, refer to the [BigQuery Performance Analysis Queries](bq_queries.md) guide.
diff --git a/npi/gce_npi_playbook.ipynb b/npi/gce_npi_playbook.ipynb
@@ -0,0 +1,135 @@
+{
+ "cells": [
+  {
+   "id": "3cb87511",
+   "cell_type": "markdown",
+   "source": [
+    "# GCE NPI Benchmark Playbook\n",
+    "\n",
+    "This runnable playbook guides you through building benchmark images and remotely orchestrating NPI benchmarks on a Google Compute Engine (GCE) VM. \n",
+    "\n",
+    "Unlike GKE, which runs Jobs natively, GCE requires us to execute `npi.py` which runs Docker containers on the target machine. This notebook executes those commands remotely via `gcloud compute ssh`.\n",
+    "\n",
+    "### Instructions\n",
+    "1. Ensure your GCE VM is already created and running.\n",
+    "2. Run each cell sequentially.\n",
+    "3. If a cell fails, stop and resolve the error before continuing."
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "941d6c12",
+   "cell_type": "code",
+   "source": [
+    "# --- CONFIGURATION VARIABLES ---\n",
+    "# Replace these with your actual environment details before running any other cells.\n",
+    "\n",
+    "PROJECT_ID = \"YOUR_PROJECT_ID\"\n",
+    "GCSFUSE_VERSION = \"v3.5.6\"  # Example: v3.5.6\n",
+    "BUCKET_NAME = \"YOUR_BUCKET_NAME\"\n",
+    "BQ_DATASET_ID = \"YOUR_BQ_DATASET_ID\"\n",
+    "\n",
+    "# GCE Specific Variables\n",
+    "VM_NAME = \"YOUR_VM_NAME\"\n",
+    "VM_ZONE = \"YOUR_VM_ZONE\" # e.g., us-central1-a"
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "807a6d10",
+   "cell_type": "markdown",
+   "source": [
+    "## Step 1: Build Benchmark Images\n",
+    "This will use Google Cloud Build to construct the Docker images required for testing and upload them to Artifact Registry."
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "b21dfe48",
+   "cell_type": "code",
+   "source": [
+    "!gcloud config set project {PROJECT_ID}\n",
+    "\n",
+    "# Enable Artifact Registry API and create the repository (errors ignored if it already exists)\n",
+    "!gcloud services enable artifactregistry.googleapis.com --project={PROJECT_ID}\n",
+    "!gcloud artifacts repositories create gcsfuse-benchmarks --repository-format=docker --location=us --project={PROJECT_ID} || echo \"Repository might already exist.\"\n",
+    "\n",
+    "# Build the images using the Makefile\n",
+    "!make build PROJECT={PROJECT_ID} GCSFUSE_VERSION={GCSFUSE_VERSION}\n",
+    "\n",
+    "# Verify Images\n",
+    "!gcloud artifacts docker images list us-docker.pkg.dev/{PROJECT_ID}/gcsfuse-benchmarks"
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "276923c2",
+   "cell_type": "markdown",
+   "source": [
+    "## Step 2: Configure the Target GCE VM\n",
+    "Ensure the remote VM has the necessary tools (Docker, python3, lscpu) and copy the benchmarking tools over."
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "79018b83",
+   "cell_type": "code",
+   "source": [
+    "# Install system dependencies\n",
+    "!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"sudo apt-get update \u0026\u0026 sudo apt-get install -y util-linux python3 docker.io\"\n",
+    "\n",
+    "# Ensure the user has permissions to run docker without sudo\n",
+    "!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"sudo usermod -aG docker \\$USER\"\n",
+    "\n",
+    "# Authenticate docker to artifact registry so the VM can pull the benchmark images\n",
+    "!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"gcloud auth configure-docker us-docker.pkg.dev --quiet\"\n",
+    "\n",
+    "# Copy the benchmarking code to the VM\n",
+    "!gcloud compute scp --recurse ../npi ../fio {VM_NAME}:~ --zone={VM_ZONE} --project={PROJECT_ID}"
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "d6f4cb37",
+   "cell_type": "markdown",
+   "source": [
+    "## Step 3: Run the Benchmarks\n",
+    "Execute the `npi.py` script remotely on the VM. This script automatically downloads the images and orchestrates the benchmark test matrix sequentially."
+   ],
+   "metadata": {},
+   "execution_count": null
+  },
+  {
+   "id": "da78aa92",
+   "cell_type": "code",
+   "source": [
+    "# We use 'sg docker -c' to ensure the user's new docker group membership is picked up without requiring them to log out and back in\n",
+    "benchmark_cmd = f\"cd npi \u0026\u0026 sg docker -c 'python3 npi.py --benchmarks all --bucket-name {BUCKET_NAME} --project-id {PROJECT_ID} --bq-dataset-id {BQ_DATASET_ID} --gcsfuse-version {GCSFUSE_VERSION}'\"\n",
+    "\n",
+    "print(f\"Executing: {benchmark_cmd} on {VM_NAME}...\")\n",
+    "\n",
+    "!gcloud compute ssh {VM_NAME} --zone={VM_ZONE} --project={PROJECT_ID} --command=\"{benchmark_cmd}\""
+   ],
+   "metadata": {},
+   "execution_count": null
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat_minor": 5,
+ "nbformat": 4
+}