A small, production-style CLI for validating tabular feature datasets against a contract before training or scoring jobs run.
It checks schema presence, row count, null thresholds, type parsing, duplicate keys, and freshness, then emits a JSON report and CI-friendly exit codes.
flowchart LR
A["Feature dataset CSV"] --> C["feature-pipeline-quality validate"]
B["Contract JSON"] --> C
D["As-of date"] --> C
C --> E["Structured JSON report"]
C --> F["Exit code for CI or orchestration"]
F --> G{"Passed?"}
G -->|Yes| H["Training or scoring job proceeds"]
G -->|No| I["Pipeline stops with explicit failure"]
Feature pipelines often fail silently until model training or scoring breaks downstream. This repo demonstrates a simple contract-check pattern that can run in scheduled jobs, CI, or orchestration tasks.
- Required column checks
- Row count minimums
- Column-level null ratio thresholds
- Type parsing checks (
str,int,float,bool,date,datetime) - Duplicate detection on composite keys
- Freshness checks on a date/datetime column
- Structured JSON output + non-zero exit code on failure
examples/contract.jsonshows the validation contract expected by the CLI.examples/features_good.csvandexamples/features_bad.csvgive passing and failing fixtures.tests/test_validator.pycovers the core contract checks and exit behavior.pyproject.tomlexposes the package metadata and console entry point.
python -m feature_pipeline_quality validate \
--contract examples/contract.json \
--data examples/features_good.csv \
--as-of 2026-02-25Failing example:
python -m feature_pipeline_quality validate \
--contract examples/contract.json \
--data examples/features_bad.csv \
--as-of 2026-02-25Write a JSON report:
python -m feature_pipeline_quality validate \
--contract examples/contract.json \
--data examples/features_good.csv \
--report /tmp/feature-quality-report.json0= dataset passed contract checks2= one or more checks failed1= invalid input / runtime error
Use this pattern ahead of model training, batch scoring, or feature publication when you want explicit data contracts instead of silent downstream breakage. It pairs naturally with policy gates such as ml-release-gates and the broader operating patterns in ml-platform-reference.
python -m unittest discover -s tests -v