Skip to content

clinops: A Python Toolkit for Clinical ML Data Pipelines #287

@chaitanyakasaraneni

Description

@chaitanyakasaraneni

Submitting Author: Chaitanya Kasaraneni (@chaitanyakasaraneni)
Package Name: clinops
One-Line Description of Package: Production-grade Python toolkit for clinical ML data pipelines — patient-safe splitting, physiological outlier clipping, unit normalization, and ICD harmonization for MIMIC-IV and FHIR R4
Repository Link (if existing): https://github.com/chaitanyakasaraneni/clinops
EiC: TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:

clinops is a Python library providing validated, reusable abstractions for the data engineering steps that precede model training in clinical ML workflows. It directly addresses four failure modes endemic to healthcare AI pipelines: patient leakage in train/test splits (the most common source of inflated performance metrics in published ICU prediction models), unit heterogeneity in multi-site data (e.g., glucose in mg/dL vs. mmol/L), ICD-9 to ICD-10 version discontinuities introduced by the October 2015 US coding transition, and physiological outliers that standard statistical methods misclassify as noise. The library ships with four modules — clinops.ingest (MIMIC-IV, FHIR R4, and flat file loaders), clinops.preprocess (outlier clipping, unit normalization, ICD mapping), clinops.temporal (gap-aware sliding windows, lag features, cohort alignment, fit/transform imputation), and clinops.split (temporal, patient-level, and stratified patient splitting) — and is fully tested (118 tests, 85% coverage) with a scikit-learn-compatible API.

Community Partnerships

We partner with communities to support peer review with an additional layer of
checks that satisfy community requirements. If your package fits into an
existing community please check below:

Scope

  • Please indicate which category or categories this package falls under:

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). For community partnerships, check also their specific guidelines as documented in the links above. Please note any areas you are unsure of:

clinops falls under data retrieval (structured loaders for MIMIC-IV and FHIR R4), data extraction (temporal windowing and lag feature generation from clinical event streams), data processing/munging (unit normalization, ICD harmonization, outlier clipping), and data validation and testing (physiological bound enforcement and schema validation on load). It is domain-specific to clinical/EHR data but does not fit the Geospatial or Education community partnership categories.

  • Who is the target audience and what are the scientific applications of this package?
    The primary audience is clinical ML researchers and healthcare data engineers who build predictive models on electronic health record data — particularly MIMIC-IV, PhysioNet datasets, and institutional EHR exports. Scientific applications include ICU outcome prediction (mortality, length of stay, organ failure), cohort construction for clinical studies, and pipelines that ingest heterogeneous, multi-site EHR data. The library is also relevant to engineers building production ML pipelines in healthcare settings where data quality and leakage prevention are regulatory requirements.

  • Are there other Python packages that accomplish similar things? If so, how does yours differ?
    No existing Python package addresses the full combination of problems clinops targets. The closest adjacent tools are:

    • pandas / scikit-learn — general-purpose; do not know about physiological bounds, patient-level splitting, ICD codes, or clinical unit conversions
    • MIMIC-Extract — extracts features from MIMIC-III specifically; not designed as a reusable library, not maintained for MIMIC-IV, no splitting or preprocessing abstractions
    • medcat / spaCy clinical models — NLP-focused; operate on clinical text rather than structured EHR tables
    • lifelines — survival analysis; does not address data ingestion, splitting, or preprocessing

clinops is the first library to package the specific set of clinical data engineering safeguards (patient-safe splitting, physiological outlier bounds, unit normalization, ICD harmonization) together in a scikit-learn-compatible API designed for reuse across projects.

  • Any other questions or issues we should be aware of:
    JOSS opt-in: I would like to opt into the pyOpenSci–JOSS partnership. A paper.md following JOSS requirements is already in the repository. I previously submitted directly to JOSS (issue #10111) and was rejected at pre-review solely for insufficient public development history (the repository was 2 days old at submission). I understand PyOpenSci does not have this requirement, and that PyOpenSci acceptance routes to JOSS without a second review.

P.S. Have feedback/comments about our review process? Leave a comment here

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    pre-review-checks
    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions