A Python ETL repository that ingests Kaggle datasets, runs local extract/transform/load workflows, and optionally uploads raw and processed data to AWS S3.
- Local Kaggle ingestion scripts for e-commerce, healthcare, finance, sports, and climate domains under
ingest/ - A modular ETL pipeline under
src/andetl/ - Root environment configuration templates in
.env.exampleand reusable settings underconfig/ - AWS helper code for S3 upload and optional Redshift load under
src/cloud/ - E-commerce analytics SQL in
analytics/ecommerce_queries.sql - Warehouse schema DDL in
warehouse/schemas/*.sql - Monitoring examples in
monitoring/ - A pytest-based test suite in
tests/
Note:
terraform/contains a Terraform root configuration and AWS provider file, but the referenced Terraform module sources are not included in this repository. The supported workflow is local development with optional AWS helper support.
- Local ETL and data ingestion are implemented in Python.
- AWS S3 upload and optional Redshift helper methods exist, but full multi-service cloud provisioning is not available in this checkout.
dags/andk8s/provide deployment skeletons rather than a complete cloud production stack..envis a local configuration file that should not be committed.- Data directories under
data/are excluded from version control and should be created locally. - This repository is best used for local pipeline development, testing, and Kaggle ingestion.
- Python 3.10+
- Git
pip- Kaggle account + API credentials
- Optional: AWS CLI and AWS credentials for S3 upload
git clone https://github.com/Victor-Kipruto-Rop/cloud-etl-pipeline.git
cd cloud-etl-pipeline
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
mkdir -p data/raw data/processed data/analyticspython -m ingest.kaggle_ingest --domain ecommerceThe ingest/kaggle_ingest.py script downloads Kaggle dataset files into data/raw/ and can optionally trigger local processing workflows.
AWS_S3_BUCKET=your-bucket \
AWS_REGION=us-west-1 \
KAGGLE_DATASET=olistbr/brazilian-ecommerce \
KAGGLE_DOWNLOAD=true \
KAGGLE_FORCE_DOWNLOAD=true \
.venv/bin/python3 -m src.cloud.aws_etlThis command downloads the specified Kaggle dataset, processes CSV files, writes Parquet outputs to data/processed/, uploads raw CSV files to S3, and optionally uploads processed Parquet files.
.venv/bin/python3 -m pytest -qingest/: dataset download and ingestion orchestrationsrc/pipeline.py: local ETL orchestrationsrc/extract/,src/transform/,src/load/: pipeline stagessrc/cloud/aws_etl.py: AWS helper orchestrationsrc/cloud/aws_s3.py: S3 upload utilitiessrc/cloud/aws_redshift.py: Redshift load helperdags/: Airflow DAG skeleton for ecommerce ETL orchestrationk8s/: Kubernetes ETL job manifest skeleton
cloud-etl-pipeline/
├── README.md
├── ARCHITECTURE.md
├── DEPLOYMENT.md
├── IMPLEMENTATION_SUMMARY.md
├── RELEASE_NOTES.md
├── CREDIBILITY_AUDIT_FIXES.md
├── TROUBLESHOOTING.md
├── PROJECT_STRUCTURE.md
├── requirements.txt
├── pyproject.toml
├── .env.example
├── config/
│ ├── aws_config.yaml
│ └── domains.yaml
├── .github/
│ └── workflows/
│ ├── ci.yml
│ ├── lint.yml
│ ├── python-app.yml
│ ├── run_pipeline.yml
│ ├── deploy_glue.yml
│ └── deploy_terraform.yml
├── ingest/
│ ├── kaggle_ingest.py
│ ├── ecommerce_ingest.py
│ ├── healthcare_ingest.py
│ ├── finance_ingest.py
│ ├── sports_ingest.py
│ ├── climate_ingest.py
│ └── config.py
├── src/
│ ├── api.py
│ ├── config.py
│ ├── dashboard.py
│ ├── health.py
│ ├── logging_config.py
│ ├── migrations.py
│ ├── pipeline.py
│ ├── validation.py
│ └── cloud/
│ ├── aws_etl.py
│ ├── aws_redshift.py
│ ├── aws_s3.py
│ └── __init__.py
├── etl/
│ ├── __init__.py
│ └── ecommerce_transform.py
├── analytics/
│ └── ecommerce_queries.sql
├── warehouse/
│ └── schemas/
│ ├── ecommerce.sql
│ ├── healthcare.sql
│ ├── finance.sql
│ ├── sports.sql
│ └── climate.sql
├── monitoring/
│ ├── alert_rules.yml
│ ├── docker-compose.monitoring.yml
│ ├── grafana-dashboard.json
│ ├── playbook.md
│ ├── prometheus.yml
│ └── README.md
├── diagrams/
│ └── system_diagrams.md
├── data/
│ ├── raw/
│ ├── processed/
│ └── analytics/
├── tests/
├── terraform/
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
└── infra/
└── aws/
└── provider.tf
data/raw/anddata/processed/are local working directories.terraform/andinfra/aws/provide AWS configuration skeletons, but the repository is not a complete, runnable cloud deployment package on its own.- Use the local pipeline path for development and testing.