-
Notifications
You must be signed in to change notification settings - Fork 5
docs: add GCE and GKE documentation for running NPI benchmarks #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
2e4ef40
db7dd1e
017be3e
b14c79f
57a1d6b
83c7e0c
d312302
b26ed99
61a05fe
418c2ba
3005701
df359cf
222bc10
a6e0e5c
95f07d3
e0d0447
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Running NPI Benchmarks on GCE | ||
|
|
||
| This guide explains how to build the required Docker images and run the NPI (Network Performance Improvement) benchmarks on a Google Compute Engine (GCE) VM. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Google Cloud Project**: You need a GCP project to host your Artifact Registry, Cloud Storage bucket, and BigQuery dataset. | ||
| 2. **VM Setup**: A GCE VM with Docker installed. The VM's service account must have the following scopes: | ||
| * `https://www.googleapis.com/auth/bigquery` | ||
| * `https://www.googleapis.com/auth/devstorage.read_write` | ||
| * `https://www.googleapis.com/auth/cloud-platform` (or granular scopes) | ||
| 3. **Authentication**: If not using a VM service account, you must authenticate using `gcloud auth login` and `gcloud auth application-default login`. | ||
| 4. **BigQuery Dataset**: Create a BigQuery dataset to store the benchmark results. | ||
| 5. **Artifact Registry**: You must have an Artifact Registry repository to host the Docker images. | ||
| ```bash | ||
| gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID | ||
| gcloud artifacts repositories create gcsfuse-benchmarks \ | ||
| --repository-format=docker \ | ||
| --location=us \ | ||
| --project=YOUR_PROJECT_ID | ||
| ``` | ||
|
|
||
| ## Step 1: Build the Benchmark Images | ||
|
|
||
| The NPI benchmarks use Docker containers to provide an isolated environment for running tests. You can build and push these images to your Artifact Registry using the provided `Makefile`. | ||
|
|
||
| By default, the `Makefile` builds for project `gcs-fuse-test`. You should either edit the `Makefile` or pass your project ID to the `make` command: | ||
|
|
||
| ```bash | ||
| cd gcsfuse-tools/npi | ||
|
|
||
| # Option 1: Override project and version in the make command | ||
| make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=v3.5.6 | ||
|
|
||
| # Option 2: Edit Makefile directly to change the PROJECT variable, then run: | ||
| make build | ||
| ``` | ||
|
|
||
| This will trigger a Cloud Build job (`cloudbuild.yaml`) that builds the `fio-read-benchmark`, `fio-write-benchmark`, and `orbax-emulated-benchmark` images and pushes them to `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/`. | ||
|
|
||
| ## Step 2: Run the Benchmarks | ||
|
|
||
| Once the images are built, you can use the `npi.py` script to orchestrate the benchmark runs. The script automatically generates the correct `docker run` commands based on the desired benchmarks and configurations. | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| Run all available benchmarks: | ||
|
|
||
| ```bash | ||
| python3 npi.py \ | ||
| --benchmarks 'all' \ | ||
| --bucket-name YOUR_GCS_BUCKET \ | ||
| --project-id YOUR_PROJECT_ID \ | ||
| --bq-dataset-id YOUR_BQ_DATASET_ID \ | ||
| --gcsfuse-version v3.5.6 | ||
| ``` | ||
|
|
||
| ### Specifying Benchmarks | ||
|
|
||
| You can specify a space-separated list of specific benchmarks to run. For example, to run only the `read_http1` and `write_grpc` benchmarks: | ||
|
|
||
| ```bash | ||
| python3 npi.py \ | ||
| --benchmarks read_http1 write_grpc \ | ||
| --bucket-name YOUR_GCS_BUCKET \ | ||
| --project-id YOUR_PROJECT_ID \ | ||
| --bq-dataset-id YOUR_BQ_DATASET_ID \ | ||
| --gcsfuse-version v3.5.6 | ||
| ``` | ||
|
|
||
| ### Understanding the Parameters | ||
|
|
||
| * `--benchmarks`: 'all' or a list of specific benchmarks (e.g., `read_http1`, `orbax_read_grpc_numa0_fio_bound`). | ||
| * `--bucket-name`: The GCS bucket where FIO will read/write data. | ||
| * `--project-id`: The GCP Project ID containing your BigQuery dataset. | ||
| * `--bq-dataset-id`: The BigQuery dataset where results will be inserted. | ||
| * `--gcsfuse-version`: The version tag used when building the images (e.g., `v3.5.6`). | ||
| * `--iterations`: (Optional) Number of FIO test iterations. Default is 5. | ||
| * `--temp-dir`: (Optional) FUSE temp directory type (`boot-disk` or `memory`). | ||
|
|
||
| ### Dry Run | ||
|
|
||
| To see exactly what `docker run` commands the script will execute without actually running them, add the `--dry-run` flag: | ||
|
|
||
| ```bash | ||
| python3 npi.py \ | ||
| --benchmarks read_http1 \ | ||
| --bucket-name YOUR_GCS_BUCKET \ | ||
| --project-id YOUR_PROJECT_ID \ | ||
| --bq-dataset-id YOUR_BQ_DATASET_ID \ | ||
| --gcsfuse-version v3.5.6 \ | ||
| --dry-run | ||
| ``` |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,157 @@ | ||||||||||
| # Running NPI Benchmarks on GKE | ||||||||||
|
|
||||||||||
| This guide explains how to build and run the NPI (Network Performance Improvement) benchmarks on Google Kubernetes Engine (GKE). Unlike GCE where `npi.py` automatically orchestrates Docker runs, in GKE we manually define and deploy Pods that use the benchmark images, leveraging the GKE GCS Fuse CSI driver. | ||||||||||
|
|
||||||||||
| ## Step 1: Build the Benchmark Images | ||||||||||
|
|
||||||||||
| The process for building the Docker images is identical to GCE. We use Cloud Build via the provided `Makefile`. | ||||||||||
|
|
||||||||||
| 1. Ensure your Artifact Registry repository is created: | ||||||||||
| ```bash | ||||||||||
| gcloud services enable artifactregistry.googleapis.com --project=YOUR_PROJECT_ID | ||||||||||
| gcloud artifacts repositories create gcsfuse-benchmarks \ | ||||||||||
| --repository-format=docker \ | ||||||||||
| --location=us \ | ||||||||||
| --project=YOUR_PROJECT_ID | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. Build and push the images: | ||||||||||
| ```bash | ||||||||||
| cd gcsfuse-tools/npi | ||||||||||
| make build PROJECT=YOUR_PROJECT_ID GCSFUSE_VERSION=v3.5.6 | ||||||||||
| ``` | ||||||||||
| This creates images such as `us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest`. | ||||||||||
|
|
||||||||||
| ## Step 2: Configure Workload Identity (Permissions) | ||||||||||
|
|
||||||||||
| To allow your GKE Pods to access the GCS bucket and write metrics to BigQuery, you should use **Workload Identity**. This links a Kubernetes Service Account (KSA) to a Google Cloud Service Account (GSA). | ||||||||||
|
|
||||||||||
| 1. **Create a Google Cloud Service Account (GSA):** | ||||||||||
| ```bash | ||||||||||
| gcloud iam service-accounts create benchmark-gsa \ | ||||||||||
| --project=YOUR_PROJECT_ID | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. **Grant the necessary roles to the GSA:** | ||||||||||
| The GSA needs permissions to read/write to the GCS bucket and to insert records into BigQuery. | ||||||||||
| ```bash | ||||||||||
| # Grant BigQuery Data Editor | ||||||||||
| gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \ | ||||||||||
| --member "serviceAccount:benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \ | ||||||||||
| --role "roles/bigquery.dataEditor" | ||||||||||
|
|
||||||||||
| # Grant Storage Object Admin (or a more restricted role) on your bucket | ||||||||||
| gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \ | ||||||||||
| --member "serviceAccount:benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \ | ||||||||||
| --role "roles/storage.objectAdmin" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 3. **Create a Kubernetes Service Account (KSA):** | ||||||||||
| ```bash | ||||||||||
| kubectl create serviceaccount benchmark-ksa \ | ||||||||||
| --namespace default | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 4. **Bind the KSA to the GSA:** | ||||||||||
| ```bash | ||||||||||
| gcloud iam service-accounts add-iam-policy-binding benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com \ | ||||||||||
| --role roles/iam.workloadIdentityUser \ | ||||||||||
| --member "serviceAccount:YOUR_PROJECT_ID.svc.id.goog[default/benchmark-ksa]" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 5. **Annotate the KSA:** | ||||||||||
| ```bash | ||||||||||
| kubectl annotate serviceaccount benchmark-ksa \ | ||||||||||
| --namespace default \ | ||||||||||
| iam.gke.io/gcp-service-account=benchmark-gsa@YOUR_PROJECT_ID.iam.gserviceaccount.com | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ## Step 3: Run the Benchmarks as Pods | ||||||||||
|
|
||||||||||
| In GKE, we don't use `npi.py`. Instead, we deploy Kubernetes Pods. The GCSFuse mounting is handled directly by the **GKE GCS Fuse CSI driver**, and we pass the mount path to the benchmark container. | ||||||||||
|
|
||||||||||
| ### Important Considerations for GKE | ||||||||||
|
|
||||||||||
| * **NUMA Binding**: NUMA binding does not currently make sense in GKE. Exclude any NUMA-bound benchmarks (e.g., skip anything with `numa0` or `numa1` in the name). | ||||||||||
| * **gRPC Benchmarks**: When running gRPC benchmarks, you **must** change the `mountOptions` in the Pod's volume definition to include `"client-protocol=grpc"`. | ||||||||||
| * **BigQuery Parameters**: The container needs `args` specifying where to send the metrics since it no longer inherits them from `npi.py`. | ||||||||||
|
|
||||||||||
| ### Sample Pod Specification | ||||||||||
|
|
||||||||||
| Here is a sample `pod.yaml` for running the `read_http1` benchmark on a TPU node: | ||||||||||
|
|
||||||||||
| ```yaml | ||||||||||
| apiVersion: v1 | ||||||||||
| kind: Pod | ||||||||||
| metadata: | ||||||||||
| name: fio-bench-read-http1 | ||||||||||
| namespace: default | ||||||||||
| annotations: | ||||||||||
| gke-gcsfuse/volumes: "true" | ||||||||||
| spec: | ||||||||||
| nodeSelector: | ||||||||||
| cloud.google.com/gke-tpu-topology: 2x2 | ||||||||||
| cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice | ||||||||||
| restartPolicy: Never | ||||||||||
| serviceAccountName: benchmark-ksa # KSA configured with Workload Identity | ||||||||||
| containers: | ||||||||||
| - name: fio | ||||||||||
| image: us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest | ||||||||||
| args: | ||||||||||
| - "--mount-path=/data" | ||||||||||
| - "--iterations=5" | ||||||||||
| - "--project-id=YOUR_PROJECT_ID" | ||||||||||
| - "--bq-dataset-id=YOUR_BQ_DATASET_ID" | ||||||||||
| - "--bq-table-id=fio_read_http1" | ||||||||||
| securityContext: | ||||||||||
| privileged: true | ||||||||||
| volumeMounts: | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
|
||||||||||
| - name: data-vol | ||||||||||
| mountPath: /data | ||||||||||
| volumes: | ||||||||||
| - name: data-vol | ||||||||||
| csi: | ||||||||||
| driver: gcsfuse.csi.storage.gke.io | ||||||||||
| volumeAttributes: | ||||||||||
| bucketName: "YOUR_BUCKET_NAME" | ||||||||||
| mountOptions: "client-protocol=http1" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### Running a gRPC Benchmark | ||||||||||
|
|
||||||||||
| To run a gRPC benchmark (e.g., `read_grpc`), update the `args` to emit to the correct table, and critically, update the `mountOptions`: | ||||||||||
|
|
||||||||||
| ```yaml | ||||||||||
| # ... metadata and spec ... | ||||||||||
| containers: | ||||||||||
| - name: fio | ||||||||||
| image: us-docker.pkg.dev/YOUR_PROJECT_ID/gcsfuse-benchmarks/fio-read-benchmark-v3.5.6:latest | ||||||||||
| args: | ||||||||||
| - "--mount-path=/data" | ||||||||||
| - "--iterations=5" | ||||||||||
| - "--project-id=YOUR_PROJECT_ID" | ||||||||||
| - "--bq-dataset-id=YOUR_BQ_DATASET_ID" | ||||||||||
| - "--bq-table-id=fio_read_grpc" | ||||||||||
| # ... volumeMounts ... | ||||||||||
| volumes: | ||||||||||
| - name: data-vol | ||||||||||
| csi: | ||||||||||
| driver: gcsfuse.csi.storage.gke.io | ||||||||||
| volumeAttributes: | ||||||||||
| bucketName: "YOUR_BUCKET_NAME" | ||||||||||
| mountOptions: "client-protocol=grpc" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ### Executing the Benchmark | ||||||||||
|
|
||||||||||
| Apply the YAML to your cluster to start the benchmark: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl apply -f pod.yaml | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Monitor the logs to ensure the benchmark finishes and metrics are published to BigQuery: | ||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| kubectl logs -f fio-bench-read-http1 | ||||||||||
| ``` | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
roles/storage.objectAdminrole is overly permissive as it allows management of the bucket itself (e.g., deleting the bucket or changing its IAM policy). For benchmarking,roles/storage.objectUseris a more secure alternative that provides full access to objects (read, write, delete) without bucket-level administrative permissions, adhering to the principle of least privilege.