Skip to content

Basic cluster and single machine TensorFlow script ready.#17

Merged
alex-ogs merged 52 commits into
mainfrom
spark-examples-tests
Oct 24, 2025
Merged

Basic cluster and single machine TensorFlow script ready.#17
alex-ogs merged 52 commits into
mainfrom
spark-examples-tests

Conversation

@greg-ogs

Copy link
Copy Markdown
Owner

The basic architecture of the TensorFlow script is created and ready for basic models in CNN and ANN.

  • The spark deployment is ready if the data is in a MySQL server.
  • Spark deployment is basic and is ready for GCP
  • TensorFlow deployment is ready for local servers (GCP or AWS is missing due budget approval).
  • Image-base model is ready and working (Dataset split for parameter server cluster is missing)

greg-ogs added 30 commits July 22, 2025 12:52
- Increased MySQL replica count from 3 to 5.
- Changed base image to `gregogs/percona-xtrabackup:8.4.0` and updated user to 999.
- Replaced `microdnf install` with `whoami` command and added MySQL root user for replication commands (Check for another user credentials instead of root for remaining security concerns).
- Migrated Dockerfile to example-specific directory and included essential utilities.
- Introduced `mysql-external` LoadBalancer service for external database access.
- Updated `requirements.txt` to include `mysql-connector-python` and adjusted dependencies.
- Extended MySQL example README with instructions for loading CSV data and port forwarding setup in pyhon scripts.
- Introduced `load_csv.py` to import data from `health.csv` into MySQL.
- Included functionality to create database and table if they do not exist.
- Added batch processing for efficient data insertion.
- Created `manege.sql` for basic SQL transaction commands.
- Updated `examples/dockerfiles/requirements.txt` to include `huggingface_hub` package.
- Restructured GCP and Spark-related Terraform files into `infra/terraform/GCP`.
- Moved Spark deployment files to `infra/spark` directory.
- Reorganized MySQL example files into `infra/mysql-database`.
- Added detailed `README.md` files for GCP Terraform and MySQL database setup.
- Moved Spark-related scripts from `examples/spark_workloads` to `spark_checks/python_checks/`.
- Updated and expanded `README.md` with detailed instructions for GKE cluster access, Spark manifest deployments, and MySQL usage.
- Introduced MySQL StatefulSet setup with preliminary configurations and testing commands.
- Adjusted `examples/dockerfiles/requirements.txt` path to `spark_checks/dockerfiles/requirements.txt`.
More specific name for the directories.
- Add custom Spark image with JDBC connector and updated dependencies.
- Configure Spark master and worker deployments with the new image and adjust replicas (workers: 4, MySQL: 3).
- Include `requirements.txt` for Spark dependencies.
- Introduce resources for Spark workers and JDBC jar placement.
- Refine MySQL statefulset for reduced replica count.
- Replace `python_checks/Dockerfile` and `requirements.txt` with a new centralized `bastion.Dockerfile`.
- Introduce Kubernetes resources: `bastion_pod.yaml` and `spark-workload-service.yaml`.
- Enhance `spark_workload_to_local_k8s.py` for Kubernetes compatibility with environment variable overrides and better DNS resolution.
- Consolidate infrastructure under `infra/general_spark`.
- Refine README.md for local Kubernetes Spark deployment with streamlined instructions.
- Replace verbose driver and executor configuration with concise environment variable examples.
- Add clear steps for building Docker images, applying manifests, and submitting Spark jobs.
- Enhance clarity on using Spark workload pod and MySQL database setup.
- Consolidate and relocate infrastructure files under `infra/cloud` and `infra/local` directories.
- Rename and streamline Spark setup scripts for Kubernetes compatibility.
- Introduce new Python scripts for MySQL data retrieval using PySpark.
- Update and modularize Dockerfile and manifest paths for Spark workloads.
- Enhance MySQL and Spark deployment clarity with improved structure.
- Remove `spark-master-ingress.yaml` in favor of direct port-forwarding for Spark and MySQL services.
- Add `README.md` under `infra/local/external_workloads` with detailed steps for running Spark driver outside Kubernetes.
- Introduce `docker-compose.yml` for external Spark driver configuration.
- Add `spark_retrieve_data_outside.py` to demonstrate MySQL data retrieval using external Spark driver.
- Update `spark-master-service.yaml` to use LoadBalancer type for Spark master service.
…s integration

- Integrated K-Means clustering into `spark_retrieve_data_outside.py`.
- Enhanced `spark_workload_to_local_k8s.py` with partitioning options for efficient MySQL reads.
- Updated `spark-worker-deployment.yaml` to scale worker replicas from 4 to 16.
- Adjusted `bastion.Dockerfile` and paths for requirements and JDBC jar placement.
…data_outside.py`

- Reduce Spark worker replicas from 16 to 8 in `spark-worker-deployment.yaml`.
- Adjust resource requests: decrease CPU to `250m`, add memory request of `512Mi`.
- Enhance logging configuration in `spark_retrieve_data_outside.py` for better readability and error handling.
- Suppress unnecessary Spark logs by setting log level to ERROR.
- Limit displayed rows from 50 to 5 when showing data and improve logging for K-Means stages.
- Add `--conf spark.ui.showConsoleProgress=false` to reduce Spark UI log verbosity in README example.
…g, single-row inference, and weighted feature engineering

- Introduce support for in-memory K-Means and Pipeline models, replacing on-disk model saving.
- Add single-row inference functionality with hardcoded inputs mimicking feature schema.
- Implement dynamic weighting for `measure_name_vec` in feature vector based on the `MEASURE_NAME_WEIGHT` environment variable.
- Update K-Means configuration: increase `k` to 25 and `maxIter` to 1000 for better clustering.
…roved clarity and alignment with workload purpose.
- Add `KMeansWorkload` class in `k_means.py` for clustering, feature engineering, and in-memory model handling.
- Refactor code to split responsibilities:
  - `google_health_SQL.py` now focuses on MySQL data retrieval.
  - `spark_session.py` handles Spark session creation outside Kubernetes.
- Remove `main.py` and integrate functionality into the updated modules.
- Reduce Spark worker replicas from 8 to 4 in `spark-worker-deployment.yaml`.
…or modularity and streamline MySQL data retrieval logic

- Remove unused methods and redundant functionality from `spark_retrieve_data_outside.py`.
- Move core MySQL data retrieval functionality to newly created `google_health_SQL.py`.
- Simplify class design by injecting `logger`, `DB`, and `spark` as parameters.
- Add `requirements.txt` with dependencies for local Spark execution.
- Update `bastion.Dockerfile` to reference correct paths for requirements and JDBC jar.
- Rename `bastion_pod.yaml` to `bastion_as_pod.yaml` for clarity.
…r require cuda, and I don't want to build it and is not required for testing)
- Introduce MetalLB manifests, address pool, and configuration script in `infra/local/tf`.
- Add TensorFlow-based `Dockerfile` in `infra/local/tf` for GPU-enabled environment.
- Update `.gitignore` to exclude `config-kind-in-container`.
- Adjust `docker-compose.yml` to include `tf-bastion` service with necessary mounts and configurations.
Add support for chief (coordinator) node in TensorFlow distributed training

The actual train_tf_ps.py is only a placeholder for development testings and  must be refined and correctly commented for a real workload.

- Extend `train_tf_ps.py` to include a chief address and port in cluster configuration.
- Update `make_parameter_server_strategy` to validate and configure the chief node, including `TF_CONFIG` setup.
- Modify `run_tf_training_from_bastion.sh` to detect and validate a routable IPv4 address for the chief node.
- Adjust `.gitignore` to exclude `output/`.
- Add gRPC port mapping for the chief node in `docker-compose.yml`.
- Update model saving path in `train_tf_ps.py` to include `.keras` extension.
- Rename `local_cluster_workloads` to `raw-tf` for improved clarity and alignment.
- Adjust scripts, YAMLs, Dockerfile, and paths in `infra/local` and `workloads` to reflect the new directory structure.
- Update `docker-compose.yml` mounts and configurations for TensorFlow coordinator service.
- Revise documentation to include instructions for TensorFlow distributed training outside Kubernetes.
…load.

Increase default epochs to 10 in the TensorFlow training script and update README with metrics-server setup instructions.
…ion.

- Rename `local_cluster_workloads` to `raw-spark` for consistency.
- Adjust file paths for Spark and MySQL-related scripts, checks, and datasets.
- Update `README.md` with clarified setup instructions and typo corrections.
- Add new `infra/cloud/README.md` documenting GKE Spark-TensorFlow cluster setup.
- Refactor `load_csv.py` to adapt to external lb IP from kind subnet and updated dataset location.
…docstrings, improved data handling, and logging cleanup.

- Add detailed docstrings for on functions and workflow.
- Enhance doc `_open_text` and `load_health_csv`.
- Adjust a default dataset path for improved portability.
… for broader usability.

- Update function usage and docstrings accordingly.
- Fix minor typo in comments for data source resolution.
…efinements.

- Add detailed docstrings for key functions and training process.
- Introduce an additional dense layer in the model architecture.
- Adjust optimizer learning rate and improve dataset handling for distributed setup.
…training pipeline.

- Introduce functions for creating and handling image datasets (`_list_image_classes`, `_count_images`, `_make_image_dataset`).
- Add a compact CNN model `build_cnn_model` for training on image datasets.
- Extend training logic to support image data (`run_image_training`).
- Update argument parsing to include image-specific options.
- Adjust `.gitignore` to exclude `image-datasets/`.
…_tf_ps.py`.

- Generalize CSV loading utilities for broader datasets and add image dataset management functions replacing `_` prefixed counterparts (`open_text`, `list_image_classes`, `count_images`, `make_image_dataset`).
- Consolidate training logic by renaming key entry points (`run_training` -> `run_deep_training`) and models (`build_sequential_model` -> `build_deep_model`).
- Improve docstrings for clarity, update default dataset paths, and set default epochs to 10.
- Activation Function to PRelu.
- CNN dataset shuffle disabled.
- Deprecate folder-per-class structure; implement `labels.jsonl` support for flat image directories.
- Replace classification logic with CNN regressor for predicting pixel coordinates `(x_px, y_px)`.
- Disabled pixel normalization for the rescaling.
- Revise dataset loading to handle `labels.jsonl` and ensure targets align with resized image dimensions.
- Update training pipeline with regression-specific metrics, optimizer, and model design.
- Streamline docstrings, argument parsing, and default settings.
…chitecture.

Notice that the flattened layer increase drastically the model size.

- Add `flat_layer` argument for configurable flattening and dense architecture.
- Update `build_cnn_model` to support dynamic layer selection.
- Revise training workflow and adapt argument parsing for `flat_layer`.
- Improve docstrings for newly introduced functionality.
…esizing.

- Update Conv2D kernel sizes to 5 for improved feature extraction.
- Increase epochs to 100 and adjust batch size to 32 for better training.
- Add TensorBoard callback for training visualization (commented and in progress).
- Modify default image resizing dimensions to 256x320.
- Implement GPU memory growth configuration to prevent OOM errors.
- Plot Mean Absolute Error (MAE) during training for insights.
- Expand `.gitignore` to exclude Python artifacts, logs, and additional dataset files.
…sing TensorFlow model.

- Create a `ManualImageChecker` class for image prediction and plotting.
- Load pre-saved TensorFlow model for coordinate prediction.
- Implement image preprocessing and dynamic directory scanning for `.png` files.
- Include plot visualization for predicted coordinates on images.
…ls.jsonl` references.

- Rename dataset label file references from `labels.jsonl` to `clean_labels.jsonl`.
- Adjust error messages and docstrings to reflect the change.
- Remove TensorBoard callback from `model.fit()` invocation.
- Add pre-saved model structure file `100-320-by-256-A1-model.txt`.
- Include corresponding model image file `100-320-by-256-A1-model.png`.
…taset loaders, and update model references.

- Reduce dataset shuffle buffer size from `10000` to `5000` for memory efficiency.
- Increase `--epochs` default from `100` to `150` for extended training.
- Switch dataset loaders to enable shuffling for better data randomness.
- Update image resizing dimensions in `test-model.py` and adjust model reference to `100-320-by-256-A1-model`.
- Add pre-saved model structure file `150-320-by-256-A1-model.txt` and corresponding model image file.
…series-models).

- Reduce dataset shuffle buffer size from `5000` to `3000` for improved memory management.
- Replace prefetch value `AUTOTUNE` with `1` to prevent memory overflow.
- Adjust Conv2D layers and dense units: add smaller filters (e.g., 8, 16), refine dense layer to `2048` or `128` neurons depending on flattening.
- Save training history as `history.json` for reproducibility.
- Set default `flat_layer` argument to `True`.
…functionality.

- Switch model reference to `150-320-by-256-B1-model.keras`.
- Enhance visualization logic with dynamic figure clearing and saving plots to `./tf-model/plots/`.
- Add `tqdm` for progress tracking during directory scanning.
- Introduce GPU memory usage info to `150-320-by-256-B1-model.txt`.
- Expand `.gitignore` to exclude saved plots.
…aset handling in a single host and for csv and image.

- Add `validation_split`, `subset`, `seed`, and `repeat` arguments for deterministic train/validation splits and flexible dataset behavior.
- Shuffle image datasets with configurable seed for consistent results.
- Split datasets into training/validation sets with 80/20 split by default.
- Adjust `make_image_dataset` to prefetch batches and optionally repeat datasets.
- Update training workflows to incorporate validation datasets.
- Revise docstrings to reflect new functionality and dataset handling.
@greg-ogs greg-ogs requested a review from alex-ogs October 24, 2025 01:26
@greg-ogs greg-ogs added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Oct 24, 2025

@alex-ogs alex-ogs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready

@alex-ogs alex-ogs merged commit 4b9a236 into main Oct 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants