Basic cluster and single machine TensorFlow script ready. by greg-ogs · Pull Request #17 · greg-ogs/PySpark-TF-GKE

greg-ogs · 2025-10-24T01:26:58Z

The basic architecture of the TensorFlow script is created and ready for basic models in CNN and ANN.

The spark deployment is ready if the data is in a MySQL server.
Spark deployment is basic and is ready for GCP
TensorFlow deployment is ready for local servers (GCP or AWS is missing due budget approval).
Image-base model is ready and working (Dataset split for parameter server cluster is missing)

- Increased MySQL replica count from 3 to 5. - Changed base image to `gregogs/percona-xtrabackup:8.4.0` and updated user to 999. - Replaced `microdnf install` with `whoami` command and added MySQL root user for replication commands (Check for another user credentials instead of root for remaining security concerns). - Migrated Dockerfile to example-specific directory and included essential utilities.

- Introduced `mysql-external` LoadBalancer service for external database access. - Updated `requirements.txt` to include `mysql-connector-python` and adjusted dependencies. - Extended MySQL example README with instructions for loading CSV data and port forwarding setup in pyhon scripts.

- Introduced `load_csv.py` to import data from `health.csv` into MySQL. - Included functionality to create database and table if they do not exist. - Added batch processing for efficient data insertion. - Created `manege.sql` for basic SQL transaction commands.

- Updated `examples/dockerfiles/requirements.txt` to include `huggingface_hub` package.

- Restructured GCP and Spark-related Terraform files into `infra/terraform/GCP`. - Moved Spark deployment files to `infra/spark` directory. - Reorganized MySQL example files into `infra/mysql-database`. - Added detailed `README.md` files for GCP Terraform and MySQL database setup.

…-examples-tests

- Moved Spark-related scripts from `examples/spark_workloads` to `spark_checks/python_checks/`. - Updated and expanded `README.md` with detailed instructions for GKE cluster access, Spark manifest deployments, and MySQL usage. - Introduced MySQL StatefulSet setup with preliminary configurations and testing commands. - Adjusted `examples/dockerfiles/requirements.txt` path to `spark_checks/dockerfiles/requirements.txt`.

More specific name for the directories.

…to the gcp cluster is in dev

…ark.

- Add custom Spark image with JDBC connector and updated dependencies. - Configure Spark master and worker deployments with the new image and adjust replicas (workers: 4, MySQL: 3). - Include `requirements.txt` for Spark dependencies. - Introduce resources for Spark workers and JDBC jar placement. - Refine MySQL statefulset for reduced replica count.

- Replace `python_checks/Dockerfile` and `requirements.txt` with a new centralized `bastion.Dockerfile`. - Introduce Kubernetes resources: `bastion_pod.yaml` and `spark-workload-service.yaml`. - Enhance `spark_workload_to_local_k8s.py` for Kubernetes compatibility with environment variable overrides and better DNS resolution. - Consolidate infrastructure under `infra/general_spark`.

- Refine README.md for local Kubernetes Spark deployment with streamlined instructions. - Replace verbose driver and executor configuration with concise environment variable examples. - Add clear steps for building Docker images, applying manifests, and submitting Spark jobs. - Enhance clarity on using Spark workload pod and MySQL database setup.

- Consolidate and relocate infrastructure files under `infra/cloud` and `infra/local` directories. - Rename and streamline Spark setup scripts for Kubernetes compatibility. - Introduce new Python scripts for MySQL data retrieval using PySpark. - Update and modularize Dockerfile and manifest paths for Spark workloads. - Enhance MySQL and Spark deployment clarity with improved structure.

- Remove `spark-master-ingress.yaml` in favor of direct port-forwarding for Spark and MySQL services. - Add `README.md` under `infra/local/external_workloads` with detailed steps for running Spark driver outside Kubernetes. - Introduce `docker-compose.yml` for external Spark driver configuration. - Add `spark_retrieve_data_outside.py` to demonstrate MySQL data retrieval using external Spark driver. - Update `spark-master-service.yaml` to use LoadBalancer type for Spark master service.

…s integration - Integrated K-Means clustering into `spark_retrieve_data_outside.py`. - Enhanced `spark_workload_to_local_k8s.py` with partitioning options for efficient MySQL reads. - Updated `spark-worker-deployment.yaml` to scale worker replicas from 4 to 16. - Adjusted `bastion.Dockerfile` and paths for requirements and JDBC jar placement.

…data_outside.py` - Reduce Spark worker replicas from 16 to 8 in `spark-worker-deployment.yaml`. - Adjust resource requests: decrease CPU to `250m`, add memory request of `512Mi`. - Enhance logging configuration in `spark_retrieve_data_outside.py` for better readability and error handling. - Suppress unnecessary Spark logs by setting log level to ERROR. - Limit displayed rows from 50 to 5 when showing data and improve logging for K-Means stages. - Add `--conf spark.ui.showConsoleProgress=false` to reduce Spark UI log verbosity in README example.

…g, single-row inference, and weighted feature engineering - Introduce support for in-memory K-Means and Pipeline models, replacing on-disk model saving. - Add single-row inference functionality with hardcoded inputs mimicking feature schema. - Implement dynamic weighting for `measure_name_vec` in feature vector based on the `MEASURE_NAME_WEIGHT` environment variable. - Update K-Means configuration: increase `k` to 25 and `maxIter` to 1000 for better clustering.

…roved clarity and alignment with workload purpose.

- Add `KMeansWorkload` class in `k_means.py` for clustering, feature engineering, and in-memory model handling. - Refactor code to split responsibilities: - `google_health_SQL.py` now focuses on MySQL data retrieval. - `spark_session.py` handles Spark session creation outside Kubernetes. - Remove `main.py` and integrate functionality into the updated modules. - Reduce Spark worker replicas from 8 to 4 in `spark-worker-deployment.yaml`.

…or modularity and streamline MySQL data retrieval logic - Remove unused methods and redundant functionality from `spark_retrieve_data_outside.py`. - Move core MySQL data retrieval functionality to newly created `google_health_SQL.py`. - Simplify class design by injecting `logger`, `DB`, and `spark` as parameters.

- Add `requirements.txt` with dependencies for local Spark execution. - Update `bastion.Dockerfile` to reference correct paths for requirements and JDBC jar. - Rename `bastion_pod.yaml` to `bastion_as_pod.yaml` for clarity.

…r require cuda, and I don't want to build it and is not required for testing)

- Introduce MetalLB manifests, address pool, and configuration script in `infra/local/tf`. - Add TensorFlow-based `Dockerfile` in `infra/local/tf` for GPU-enabled environment. - Update `.gitignore` to exclude `config-kind-in-container`. - Adjust `docker-compose.yml` to include `tf-bastion` service with necessary mounts and configurations.

Problems with the callback of chief

Add support for chief (coordinator) node in TensorFlow distributed training The actual train_tf_ps.py is only a placeholder for development testings and must be refined and correctly commented for a real workload. - Extend `train_tf_ps.py` to include a chief address and port in cluster configuration. - Update `make_parameter_server_strategy` to validate and configure the chief node, including `TF_CONFIG` setup. - Modify `run_tf_training_from_bastion.sh` to detect and validate a routable IPv4 address for the chief node. - Adjust `.gitignore` to exclude `output/`. - Add gRPC port mapping for the chief node in `docker-compose.yml`. - Update model saving path in `train_tf_ps.py` to include `.keras` extension.

- Rename `local_cluster_workloads` to `raw-tf` for improved clarity and alignment. - Adjust scripts, YAMLs, Dockerfile, and paths in `infra/local` and `workloads` to reflect the new directory structure. - Update `docker-compose.yml` mounts and configurations for TensorFlow coordinator service. - Revise documentation to include instructions for TensorFlow distributed training outside Kubernetes.

…load. Increase default epochs to 10 in the TensorFlow training script and update README with metrics-server setup instructions.

…ion. - Rename `local_cluster_workloads` to `raw-spark` for consistency. - Adjust file paths for Spark and MySQL-related scripts, checks, and datasets. - Update `README.md` with clarified setup instructions and typo corrections. - Add new `infra/cloud/README.md` documenting GKE Spark-TensorFlow cluster setup. - Refactor `load_csv.py` to adapt to external lb IP from kind subnet and updated dataset location.

…_from_bastion.sh"

…docstrings, improved data handling, and logging cleanup. - Add detailed docstrings for on functions and workflow. - Enhance doc `_open_text` and `load_health_csv`. - Adjust a default dataset path for improved portability.

… for broader usability. - Update function usage and docstrings accordingly. - Fix minor typo in comments for data source resolution.

…efinements. - Add detailed docstrings for key functions and training process. - Introduce an additional dense layer in the model architecture. - Adjust optimizer learning rate and improve dataset handling for distributed setup.

…training pipeline. - Introduce functions for creating and handling image datasets (`_list_image_classes`, `_count_images`, `_make_image_dataset`). - Add a compact CNN model `build_cnn_model` for training on image datasets. - Extend training logic to support image data (`run_image_training`). - Update argument parsing to include image-specific options. - Adjust `.gitignore` to exclude `image-datasets/`.

…_tf_ps.py`. - Generalize CSV loading utilities for broader datasets and add image dataset management functions replacing `_` prefixed counterparts (`open_text`, `list_image_classes`, `count_images`, `make_image_dataset`). - Consolidate training logic by renaming key entry points (`run_training` -> `run_deep_training`) and models (`build_sequential_model` -> `build_deep_model`). - Improve docstrings for clarity, update default dataset paths, and set default epochs to 10.

- Activation Function to PRelu. - CNN dataset shuffle disabled. - Deprecate folder-per-class structure; implement `labels.jsonl` support for flat image directories. - Replace classification logic with CNN regressor for predicting pixel coordinates `(x_px, y_px)`. - Disabled pixel normalization for the rescaling. - Revise dataset loading to handle `labels.jsonl` and ensure targets align with resized image dimensions. - Update training pipeline with regression-specific metrics, optimizer, and model design. - Streamline docstrings, argument parsing, and default settings.

…chitecture. Notice that the flattened layer increase drastically the model size. - Add `flat_layer` argument for configurable flattening and dense architecture. - Update `build_cnn_model` to support dynamic layer selection. - Revise training workflow and adapt argument parsing for `flat_layer`. - Improve docstrings for newly introduced functionality.

…esizing. - Update Conv2D kernel sizes to 5 for improved feature extraction. - Increase epochs to 100 and adjust batch size to 32 for better training. - Add TensorBoard callback for training visualization (commented and in progress). - Modify default image resizing dimensions to 256x320. - Implement GPU memory growth configuration to prevent OOM errors. - Plot Mean Absolute Error (MAE) during training for insights. - Expand `.gitignore` to exclude Python artifacts, logs, and additional dataset files.

…sing TensorFlow model. - Create a `ManualImageChecker` class for image prediction and plotting. - Load pre-saved TensorFlow model for coordinate prediction. - Implement image preprocessing and dynamic directory scanning for `.png` files. - Include plot visualization for predicted coordinates on images.

…ls.jsonl` references. - Rename dataset label file references from `labels.jsonl` to `clean_labels.jsonl`. - Adjust error messages and docstrings to reflect the change. - Remove TensorBoard callback from `model.fit()` invocation. - Add pre-saved model structure file `100-320-by-256-A1-model.txt`. - Include corresponding model image file `100-320-by-256-A1-model.png`.

…taset loaders, and update model references. - Reduce dataset shuffle buffer size from `10000` to `5000` for memory efficiency. - Increase `--epochs` default from `100` to `150` for extended training. - Switch dataset loaders to enable shuffling for better data randomness. - Update image resizing dimensions in `test-model.py` and adjust model reference to `100-320-by-256-A1-model`. - Add pre-saved model structure file `150-320-by-256-A1-model.txt` and corresponding model image file.

…series-models). - Reduce dataset shuffle buffer size from `5000` to `3000` for improved memory management. - Replace prefetch value `AUTOTUNE` with `1` to prevent memory overflow. - Adjust Conv2D layers and dense units: add smaller filters (e.g., 8, 16), refine dense layer to `2048` or `128` neurons depending on flattening. - Save training history as `history.json` for reproducibility. - Set default `flat_layer` argument to `True`.

…functionality. - Switch model reference to `150-320-by-256-B1-model.keras`. - Enhance visualization logic with dynamic figure clearing and saving plots to `./tf-model/plots/`. - Add `tqdm` for progress tracking during directory scanning. - Introduce GPU memory usage info to `150-320-by-256-B1-model.txt`. - Expand `.gitignore` to exclude saved plots.

…aset handling in a single host and for csv and image. - Add `validation_split`, `subset`, `seed`, and `repeat` arguments for deterministic train/validation splits and flexible dataset behavior. - Shuffle image datasets with configurable seed for consistent results. - Split datasets into training/validation sets with 80/20 split by default. - Adjust `make_image_dataset` to prefetch batches and optionally repeat datasets. - Update training workflows to incorporate validation datasets. - Revise docstrings to reflect new functionality and dataset handling.

alex-ogs

Ready

greg-ogs added 30 commits July 22, 2025 12:52

Add README for MySQL database example.

5356adb

Update README with MySQL database setup and usage.

d50df42

Add huggingface_hub to Dockerfile requirements

fb392c5

- Updated `examples/dockerfiles/requirements.txt` to include `huggingface_hub` package.

Merge remote-tracking branch 'origin/spark-examples-tests' into spark…

f646c83

…-examples-tests

Refactor infra/spark to infra/gcp_spark.

430df84

More specific name for the directories.

Testing dataset and readme for local cluster testings added.

a959250

Deployment for local cluster of spark is working. An example similar …

edd76e6

…to the gcp cluster is in dev

Retrieve information from the mysql database using containerized pysp…

c2f055d

…ark.

Rename spark_retrieve_data.py to pod_google_health_SQL.py for imp…

333a889

…roved clarity and alignment with workload purpose.

Adjust bastion setup for local Spark workloads

1770b07

- Add `requirements.txt` with dependencies for local Spark execution. - Update `bastion.Dockerfile` to reference correct paths for requirements and JDBC jar. - Rename `bastion_pod.yaml` to `bastion_as_pod.yaml` for clarity.

First TF try

49810b0

Second TF try, no gpu for kind cluster (containers of the kind cluste…

8ee19ce

…r require cuda, and I don't want to build it and is not required for testing)

Third tf try, the first workload

1bac559

Switch train_tf_ps.py to ClusterCoordinator custom training.

063acab

Problems with the callback of chief

greg-ogs added 22 commits September 8, 2025 19:51

Metrics server added to check the correct distribution of the tf work…

93e830a

…load. Increase default epochs to 10 in the TensorFlow training script and update README with metrics-server setup instructions.

Add doc to sh starting script for the model training "run_tf_training…

0ae5cf4

…_from_bastion.sh"

Refactor train_tf_ps.py by renaming load_health_csv to load_csv…

cfd8eb1

… for broader usability. - Update function usage and docstrings accordingly. - Fix minor typo in comments for data source resolution.

Minor docstrings changes in model info files

2b86bec

New model architecture and new history json file

1c0a5a4

Callbacks and prefetch info are added

46343a4

greg-ogs requested a review from alex-ogs October 24, 2025 01:26

greg-ogs assigned greg-ogs and alex-ogs Oct 24, 2025

greg-ogs added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Oct 24, 2025

alex-ogs approved these changes Oct 24, 2025

View reviewed changes

alex-ogs merged commit 4b9a236 into main Oct 24, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic cluster and single machine TensorFlow script ready.#17

Basic cluster and single machine TensorFlow script ready.#17
alex-ogs merged 52 commits into
mainfrom
spark-examples-tests

greg-ogs commented Oct 24, 2025

Uh oh!

alex-ogs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

greg-ogs commented Oct 24, 2025

The basic architecture of the TensorFlow script is created and ready for basic models in CNN and ANN.

Uh oh!

alex-ogs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants