Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
2e4ef40
docs: add GCE and GKE documentation for running NPI benchmarks
kislaykishore Apr 10, 2026
db7dd1e
docs: use roles/storage.objectUser instead of objectAdmin
kislaykishore Apr 10, 2026
017be3e
docs: remove unnecessary privileged: true from GKE pod spec
kislaykishore Apr 10, 2026
b14c79f
docs: add lscpu and python3 to prerequisites in GCE docs
kislaykishore Apr 10, 2026
57a1d6b
docs: replace hardcoded v3.5.6 with YOUR_GCSFUSE_VERSION template
kislaykishore Apr 10, 2026
83c7e0c
docs: extract GKE pod specs to gke_pod_specs directory
kislaykishore Apr 10, 2026
d312302
docs: add run_gke_benchmarks.sh to orchestrate sequential pod execution
kislaykishore Apr 10, 2026
b26ed99
refactor: use Kubernetes Jobs instead of Pods for GKE benchmarks
kislaykishore Apr 10, 2026
61a05fe
docs: add bigquery analysis scripts for extracting throughput and lat…
kislaykishore Apr 10, 2026
418c2ba
docs: add GKE cluster setup requirements (node pools and node-selectors)
kislaykishore Apr 10, 2026
3005701
docs: improve clarity and make instructions production-ready for vendors
kislaykishore Apr 10, 2026
df359cf
docs: explicitly state Workload Identity and GCS FUSE CSI driver requ…
kislaykishore Apr 10, 2026
222bc10
docs: add get-credentials step for authenticating to GKE cluster
kislaykishore Apr 10, 2026
a6e0e5c
feat: add executable Jupyter Notebook playbook for GKE benchmarks
kislaykishore Apr 10, 2026
95f07d3
feat: add executable Jupyter Notebook playbook for GCE benchmarks
kislaykishore Apr 10, 2026
e0d0447
feat: add CLI bash scripts for automated playbook execution
kislaykishore Apr 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions npi/gce_npi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Running NPI Benchmarks on GCE

This guide explains how to build the required Docker images and run the NPI (Network Performance Improvement) benchmarks on a Google Compute Engine (GCE) VM.

## Prerequisites

1. **Google Cloud Project**: You need a GCP project to host your Artifact Registry, Cloud Storage bucket, and BigQuery dataset.
2. **VM Setup**: A GCE VM with Docker installed. The VM's service account must have the following scopes:
* `https://www.googleapis.com/auth/bigquery`
* `https://www.googleapis.com/auth/devstorage.read_write`
* `https://www.googleapis.com/auth/cloud-platform` (or granular scopes)
3. **Authentication**: If not using a VM service account, you must authenticate using `gcloud auth login` and `gcloud auth application-default login`.
4. **BigQuery Dataset**: Create a BigQuery dataset to store the benchmark results.
5. **Artifact Registry**: You must have an Artifact Registry repository to host the Docker images.
```bash
gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID
gcloud artifacts repositories create gcsfuse-benchmarks \
--repository-format=docker \
--location=us \
--project=YOUR_PROJECT_ID
```

## Step 1: Build the Benchmark Images

The NPI benchmarks use Docker containers to provide an isolated environment for running tests. You can build and push these images to your Artifact Registry using the provided `Makefile`.

By default, the `Makefile` builds for project `gcs-fuse-test`. You should either edit the `Makefile` or pass your project ID to the `make` command:

```bash
cd gcsfuse-tools/npi

# Option 1: Override project and version in the make command
make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=v3.5.6

# Option 2: Edit Makefile directly to change the PROJECT variable, then run:
make build
```

This will trigger a Cloud Build job (`cloudbuild.yaml`) that builds the `fio-read-benchmark`, `fio-write-benchmark`, and `orbax-emulated-benchmark` images and pushes them to `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/`.

## Step 2: Run the Benchmarks

Once the images are built, you can use the `npi.py` script to orchestrate the benchmark runs. The script automatically generates the correct `docker run` commands based on the desired benchmarks and configurations.

### Basic Usage

Run all available benchmarks:

```bash
python3 npi.py \
--benchmarks 'all' \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version v3.5.6
```

### Specifying Benchmarks

You can specify a space-separated list of specific benchmarks to run. For example, to run only the `read_http1` and `write_grpc` benchmarks:

```bash
python3 npi.py \
--benchmarks read_http1 write_grpc \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version v3.5.6
```

### Understanding the Parameters

* `--benchmarks`: 'all' or a list of specific benchmarks (e.g., `read_http1`, `orbax_read_grpc_numa0_fio_bound`).
* `--bucket-name`: The GCS bucket where FIO will read/write data.
* `--project-id`: The GCP Project ID containing your BigQuery dataset.
* `--bq-dataset-id`: The BigQuery dataset where results will be inserted.
* `--gcsfuse-version`: The version tag used when building the images (e.g., `v3.5.6`).
* `--iterations`: (Optional) Number of FIO test iterations. Default is 5.
* `--temp-dir`: (Optional) FUSE temp directory type (`boot-disk` or `memory`).

### Dry Run

To see exactly what `docker run` commands the script will execute without actually running them, add the `--dry-run` flag:

```bash
python3 npi.py \
--benchmarks read_http1 \
--bucket-name YOUR_GCS_BUCKET \
--project-id YOUR_PROJECT_ID \
--bq-dataset-id YOUR_BQ_DATASET_ID \
--gcsfuse-version v3.5.6 \
--dry-run
```
157 changes: 157 additions & 0 deletions npi/gke_npi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Running NPI Benchmarks on GKE

This guide explains how to build and run the NPI (Network Performance Improvement) benchmarks on Google Kubernetes Engine (GKE). Unlike GCE where `npi.py` automatically orchestrates Docker runs, in GKE we manually define and deploy Pods that use the benchmark images, leveraging the GKE GCS Fuse CSI driver.

## Step 1: Build the Benchmark Images

The process for building the Docker images is identical to GCE. We use Cloud Build via the provided `Makefile`.

1. Ensure your Artifact Registry repository is created:
```bash
gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID
gcloud artifacts repositories create gcsfuse-benchmarks \
--repository-format=docker \
--location=us \
--project=YOUR_PROJECT_ID
```

2. Build and push the images:
```bash
cd gcsfuse-tools/npi
make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=v3.5.6
```
This creates images such as `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest`.

## Step 2: Configure Workload Identity (Permissions)

To allow your GKE Pods to access the GCS bucket and write metrics to BigQuery, you should use **Workload Identity**. This links a Kubernetes Service Account (KSA) to a Google Cloud Service Account (GSA).

1. **Create a Google Cloud Service Account (GSA):**
```bash
gcloud iam service-accounts create benchmark-gsa \
--project=YOUR_PROJECT_ID
```

2. **Grant the necessary roles to the GSA:**
The GSA needs permissions to read/write to the GCS bucket and to insert records into BigQuery.
```bash
# Grant BigQuery Data Editor
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member "serviceAccount:benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role "roles/bigquery.dataEditor"

# Grant Storage Object Admin (or a more restricted role) on your bucket
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
--member "serviceAccount:benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role "roles/storage.objectAdmin"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The roles/storage.objectAdmin role is overly permissive as it allows management of the bucket itself (e.g., deleting the bucket or changing its IAM policy). For benchmarking, roles/storage.objectUser is a more secure alternative that provides full access to objects (read, write, delete) without bucket-level administrative permissions, adhering to the principle of least privilege.

Suggested change
--role "roles/storage.objectAdmin"
--role "roles/storage.objectUser"

```

3. **Create a Kubernetes Service Account (KSA):**
```bash
kubectl create serviceaccount benchmark-ksa \
--namespace default
```

4. **Bind the KSA to the GSA:**
```bash
gcloud iam service-accounts add-iam-policy-binding benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:YOUR_PROJECT_ID.svc.id.goog[default/benchmark-ksa]"
```

5. **Annotate the KSA:**
```bash
kubectl annotate serviceaccount benchmark-ksa \
--namespace default \
iam.gke.io/gcp-service-account=benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com
```

## Step 3: Run the Benchmarks as Pods

In GKE, we don't use `npi.py`. Instead, we deploy Kubernetes Pods. The GCSFuse mounting is handled directly by the **GKE GCS Fuse CSI driver**, and we pass the mount path to the benchmark container.

### Important Considerations for GKE

* **NUMA Binding**: NUMA binding does not currently make sense in GKE. Exclude any NUMA-bound benchmarks (e.g., skip anything with `numa0` or `numa1` in the name).
* **gRPC Benchmarks**: When running gRPC benchmarks, you **must** change the `mountOptions` in the Pod's volume definition to include `"client-protocol=grpc"`.
* **BigQuery Parameters**: The container needs `args` specifying where to send the metrics since it no longer inherits them from `npi.py`.

### Sample Pod Specification

Here is a sample `pod.yaml` for running the `read_http1` benchmark on a TPU node:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: fio-bench-read-http1
namespace: default
annotations:
gke-gcsfuse/volumes: "true"
spec:
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x2
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
restartPolicy: Never
serviceAccountName: benchmark-ksa # KSA configured with Workload Identity
containers:
- name: fio
image: us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest
args:
- "--mount-path=/data"
- "--iterations=5"
- "--project-id=YOUR_PROJECT_ID"
- "--bq-dataset-id=YOUR_BQ_DATASET_ID"
- "--bq-table-id=fio_read_http1"
securityContext:
privileged: true
volumeMounts:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The privileged: true setting is likely unnecessary and poses a security risk. Since the GKE GCS Fuse CSI driver handles the mounting process and the benchmark container simply accesses the data through a pre-mounted volume (--mount-path=/data), the container does not need elevated privileges. Removing this allows the Pod to run under more restrictive security policies (e.g., the GKE baseline or restricted Pod Security Standards).

Suggested change
securityContext:
privileged: true
volumeMounts:
volumeMounts:

- name: data-vol
mountPath: /data
volumes:
- name: data-vol
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: "YOUR_BUCKET_NAME"
mountOptions: "client-protocol=http1"
```

### Running a gRPC Benchmark

To run a gRPC benchmark (e.g., `read_grpc`), update the `args` to emit to the correct table, and critically, update the `mountOptions`:

```yaml
# ... metadata and spec ...
containers:
- name: fio
image: us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest
args:
- "--mount-path=/data"
- "--iterations=5"
- "--project-id=YOUR_PROJECT_ID"
- "--bq-dataset-id=YOUR_BQ_DATASET_ID"
- "--bq-table-id=fio_read_grpc"
# ... volumeMounts ...
volumes:
- name: data-vol
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: "YOUR_BUCKET_NAME"
mountOptions: "client-protocol=grpc"
```

### Executing the Benchmark

Apply the YAML to your cluster to start the benchmark:

```bash
kubectl apply -f pod.yaml
```

Monitor the logs to ensure the benchmark finishes and metrics are published to BigQuery:

```bash
kubectl logs -f fio-bench-read-http1
```
Loading