This project successfully implements an end-to-end machine learning pipeline for automated chest X-ray disease detection, demonstrating the complete journey from data collection through model training to production deployment. The system analyzes 112,120 medical images across 14 disease classes, achieving clinically useful performance for multiple conditions while providing interpretable predictions through Grad-CAM visualizations.
- Complete ML Pipeline: Data collection → EDA → Preprocessing → Training → Evaluation → Deployment
- Production System: Live dashboard deployed at nihxrays.streamlit.app with automatic CI/CD
- Transfer Learning Success: Evaluated 3 architectures (ResNet50, DenseNet121, EfficientNetB3) on Google Colab Pro+ A100 GPU
- Best Model Performance: DenseNet121 achieved AUC 0.753 overall, with 4 diseases >0.7 AUC (clinically useful threshold)
- Professional Practices: MLflow experiment tracking, automated testing (1,363 LOC), 47 documentation guides, Jupytext notebook synchronization
- Platform Evolution: Successfully migrated from Kaggle (P100) to Colab Pro+ (A100) after encountering session limits, demonstrating adaptability
- Statistical Rigor: 5 hypothesis tests validated, patient-level data splits (no leakage), expert-validated test sets
Initial Goal: AUC >0.8 across all diseases Achieved: DenseNet121 test AUC 0.753 (best model), average 0.606 across 14 diseases
Strong Performance (AUC >0.7):
- Effusion: 0.755
- Consolidation: 0.736
- Cardiomegaly: 0.714
- Atelectasis: 0.707
Challenges (AUC <0.5):
- Pneumonia: 0.372
- Hernia: 0.453 (only 47 test cases - severe class imbalance)
- Edema: 0.495
Honest Assessment: While the >0.8 goal wasn't achieved, the results demonstrate a working system with clinically relevant performance for common conditions. The challenges identified (class imbalance, spurious correlations in Grad-CAM) provide clear directions for improvement.
High-Priority Improvements (Based on Analysis):
- PyTorch Migration - Align with medical ML research standard, access MONAI and TorchXRayVision libraries
- Higher Resolution (448x448) - Preserve fine-grained medical image details (+5-10% AUC expected)
- Grad-CAM Investigation - Address model focus on non-lung regions through attention mechanisms and lung segmentation
- Class Imbalance Mitigation - Focal loss, SMOTE in embedding space, transfer learning from CheXpert
Research Extensions:
- Multi-modal fusion (X-ray + CT + EHR)
- External validation (CheXpert, MIMIC-CXR, PadChest)
- Vision Transformers and attention mechanisms
- Federated learning for multi-hospital deployment
This project demonstrates mastery of the complete ML engineering workflow, from data engineering challenges (47GB dataset, cloud GPU migration) to production deployment (Streamlit Cloud CI/CD). The honest assessment of model limitations, comprehensive documentation (47 guides), and clear improvement roadmap showcase professional data science practices suitable for real-world healthcare applications.
For Assessors: All 11 learning objectives met with documented evidence. See docs/LEARNING_OBJECTIVES_VERIFICATION.md for complete mapping.
This project leverages deep learning and computer vision techniques to detect and classify thoracic diseases from chest X-ray images. Using the NIH Chest X-Ray dataset, the analysis develops automated diagnostic support tools to assist radiologists and healthcare providers in identifying multiple pathological conditions, improving diagnostic accuracy and patient care efficiency.
Dataset: NIH Chest X-Rays (112,000+ images from 30,000+ patients) Domain: Healthcare - Medical Imaging and Diagnostic Support Target Audience: Radiologists, healthcare administrators, medical AI researchers, hospital decision-makers
🚀 View Live Dashboard | 📂 GitHub Repository
- Executive Summary
- Project Overview
- Dataset Description
- Business Requirements
- Project Hypothesis
- Training Architecture Evolution
- Machine Learning Pipeline
- Model Training Results
- Dashboard Design
- Project Structure
- Technologies Used
- Installation and Setup
- Development Process
- Testing and Validation
- Deployment
- Future Enhancements
- Credits and Acknowledgments
- Platform: Kaggle / NIH Clinical Center
- Dataset: NIH Chest X-Ray Dataset
- Images: 112,120 frontal-view X-ray images
- Patients: 30,805 unique patients
- Resolution: 1024 x 1024 pixels
- Format: PNG grayscale images
The dataset includes 14 thoracic pathology labels plus "No Finding":
- Atelectasis: Partial lung collapse
- Cardiomegaly: Enlarged heart
- Effusion: Fluid around lungs
- Infiltration: Abnormal substances in lungs
- Mass: Abnormal tissue growth
- Nodule: Small rounded growth
- Pneumonia: Lung infection
- Pneumothorax: Collapsed lung
- Consolidation: Lung tissue solidification
- Edema: Fluid buildup in lungs
- Emphysema: Damaged air sacs
- Fibrosis: Lung scarring
- Pleural Thickening: Thickened lung lining
- Hernia: Organ displacement
- No Finding: Healthy/normal
- Multi-label: Images can have multiple disease labels (co-morbidities)
- Class Imbalance: "No Finding" (60,361 images) vs rare diseases
- Metadata: Patient age, gender, view position, image dimensions
- Clinical Annotations: Text-mined from radiological reports (90%+ accuracy)
Important: The original NIH labels were automatically extracted from radiological reports using NLP, which can introduce labeling noise. To improve model quality, this project incorporates enhanced expert labels provided by Google Cloud Healthcare:
-
Source: Google Cloud Public Dataset - NIH Chest X-Ray Labels
-
Quality: Expert radiologist annotations via adjudicated review process
-
Coverage: Two label sets provided:
1. Four Findings Expert Labels (4,374 images)
- Publication: Majkowska et al., Radiology, 2019
- Paper: Chest Radiograph Interpretation with Deep Learning Models
- Findings: Airspace opacity, pneumothorax, nodule/mass, fracture
- Process: Adjudicated review by 3 radiologists from cohort of 11+ board-certified radiologists
- Sets: Validation (2,412 images) + Test (1,962 images)
2. All Findings Expert Labels (810 images)
- Publication: Nabulsi et al., Nature Scientific Reports, 2021
- Paper: Deep Learning for Distinguishing Normal vs Abnormal Chest Radiographs
- Findings: All 14 pathologies + normal/abnormal classification
- Process: Independent review by 5 board-certified radiologists, majority vote for ground truth
- Set: Test set only (PA views)
-
Format: CSV files with image IDs and adjudicated labels per finding
-
Provider: Google LLC / Google Health AI
-
License: Same as NIH dataset (CC0: Public Domain)
Why Expert Labels Matter: The original NIH labels have known accuracy limitations (~90%) due to automated extraction. Expert-validated labels provide ground truth for:
- Training more accurate models on high-quality annotations
- Validating model performance against radiologist consensus
- Reducing false positive/negative rates in critical diagnoses
- Benchmarking against published results from Radiology and Scientific Reports
- Automated Disease Detection: Develop AI-powered models to accurately identify thoracic pathologies from chest X-rays
- Multi-Disease Classification: Build systems capable of detecting multiple co-existing conditions in single images
- Diagnostic Support Tool: Create an interactive application to assist radiologists in preliminary screening and triage
- Clinical Decision Support: Provide confidence scores and visual explanations to support clinical decision-making
- Healthcare Efficiency: Reduce diagnostic time and improve early detection rates for critical conditions
- Achieve model AUC-ROC >0.80 for primary disease categories
- Successfully detect multi-label cases (images with 2+ diseases)
- Deliver interpretable predictions with visual heatmaps (Grad-CAM)
- Create user-friendly dashboard for both technical (radiologists) and non-technical (administrators) audiences
- Demonstrate statistical significance in disease pattern analysis
- Address ethical considerations (bias, privacy, clinical validation)
- Early Detection: Identify subtle disease patterns humans might miss
- Triage Support: Prioritize urgent cases in high-volume settings
- Second Opinion: Provide supplementary diagnostic confirmation
- Resource Optimization: Allocate radiologist time to complex cases
- Rural Healthcare: Support under-resourced medical facilities with limited specialists
Statement: Patient age significantly correlates with the prevalence of specific thoracic diseases, with conditions like cardiomegaly and emphysema more common in older patients, while pneumonia shows more uniform age distribution.
Validation Method:
- Age distribution analysis across disease categories
- Statistical significance testing (ANOVA, t-tests)
- Correlation coefficients between age and disease presence
- Visualization: Box plots and violin plots of age vs disease
Statement: Certain thoracic pathologies co-occur more frequently than random chance (e.g., effusion with cardiomegaly, infiltration with pneumonia), indicating underlying clinical relationships.
Validation Method:
- Co-occurrence matrix analysis
- Chi-square test for independence between disease pairs
- Association rule mining (support, confidence, lift metrics)
- Heatmap visualization of disease co-occurrence rates
Statement: Deep learning models trained on the imbalanced dataset will show significantly better performance on common conditions (e.g., "No Finding", Infiltration) compared to rare diseases (e.g., Hernia, Pneumothorax) without specialized balancing techniques.
Validation Method:
- Per-class AUC-ROC and F1-score comparison
- Baseline model vs balanced model (SMOTE, class weights) performance
- Precision-recall curves for rare vs common diseases
- Statistical significance of performance differences
Statement: Pre-trained convolutional neural networks (ResNet, DenseNet, EfficientNet) fine-tuned on chest X-rays will significantly outperform models trained from scratch, achieving higher AUC-ROC scores with fewer training epochs.
Validation Method:
- Compare baseline CNN vs transfer learning models
- Training efficiency: epochs to convergence, training time
- Performance metrics: AUC-ROC, sensitivity, specificity per disease
- Feature visualization: t-SNE plots of learned representations
Statement: Certain thoracic diseases show statistically significant gender differences in prevalence rates within the dataset.
Validation Method:
- Stratified disease prevalence analysis by gender
- Chi-square tests for gender-disease associations
- Odds ratios with confidence intervals
- Visualization: Grouped bar charts of disease rates by gender
This project evolved through multiple cloud GPU platforms to achieve optimal model training:
- Platform: Local MacBook M2 Pro
- Purpose: Data collection, EDA, preprocessing, hypothesis testing
- Tools: Jupyter notebooks, VS Code
- Outcomes: Data pipeline, statistical analysis, train/val/test splits
- Platform: Kaggle Notebooks (P100 GPU)
- Challenge: Session time limits (9-12 hours), internet connectivity issues
- Tools:
nbpushCLI tool for automated notebook deployment - Notebooks: Baseline models, CNN development, initial transfer learning attempts
- Innovation: Headless training with automated result download
- Platform: Google Colab Pro+ (A100 GPU, 40GB VRAM)
- Advantages: 24-hour sessions, faster training, better reliability
- Storage: Google Cloud Storage (GCS) for data and model artifacts
- Tools: GCS integration, OAuth authentication, MLflow tracking
- Success: Completed DenseNet121, ResNet50, EfficientNetB3 training
- Results: DenseNet121 achieved best AUC of 0.753
- Platform: Local environment
- Model: DenseNet121 (37MB, 7.6M parameters)
- Test Results: Average AUC 0.606 across 14 diseases
- Deployment: Streamlit Cloud (https://nihxrays.streamlit.app)
| Tool | Purpose | Notebooks |
|---|---|---|
| MLflow | Experiment tracking, model versioning | 07, 08 |
| Jupytext | Notebook/script synchronization (.ipynb ↔ .py) | All |
| Papermill | Parameterized notebook execution for testing | 03, 07 |
| nbpush | CLI tool for pushing notebooks to Kaggle/Colab | 05-07 |
| pytest + nbmake | Automated notebook testing in CI/CD | 02-04 |
- Download NIH Chest X-Ray dataset from Kaggle
- Load image metadata (patient ID, age, gender, disease labels)
- Explore dataset structure and label distribution
- Sample image visualization and quality assessment
- Analyze class imbalance and multi-label statistics
- Image Loading: Read PNG files, convert to arrays
- Resizing: Standardize to 224x224 pixels
- Normalization: Scale pixel values to [0,1] or standardize
- Label Encoding: Convert multi-label disease annotations to binary vectors
- Train/Validation/Test Split: 70/15/15 stratified split
- Data Augmentation (training only):
- Random rotation (±15 degrees)
- Horizontal flip (chest X-rays are symmetric)
- Brightness/contrast adjustment
- Zoom and crop variations
- Gaussian noise addition
- Metadata Analysis: Age distribution, gender ratio, view positions
- Label Distribution: Disease frequency analysis, class imbalance quantification
- Co-occurrence Analysis: Disease correlation heatmap
- Image Statistics: Pixel intensity distributions, contrast patterns
- Statistical Hypothesis Testing: Age-disease correlation, gender differences
- Visualization: Sample images per disease category
- Traditional Features (for baseline models):
- Histogram of Oriented Gradients (HOG)
- Edge detection features
- Texture descriptors (GLCM)
- Deep Learning Features:
- Pre-trained CNN feature extraction (ImageNet weights)
- Transfer learning embeddings (ResNet, DenseNet, EfficientNet)
- Metadata Features:
- Age bins (categorical)
- Gender encoding
- Image quality metrics
- Logistic Regression: On extracted HOG/texture features
- Random Forest: Multi-output classifier for multi-label prediction
- XGBoost: Gradient boosting with class weight balancing
5a. Custom CNN Architecture
- Convolutional layers with batch normalization
- Max pooling and dropout for regularization
- Dense layers for multi-label classification
- Sigmoid activation (multi-label output)
5b. Transfer Learning Models
- ResNet50: Deep residual network pre-trained on ImageNet
- DenseNet121: Densely connected architecture (commonly used for medical imaging)
- EfficientNetB3: Efficient scaling of CNN architecture
- Fine-tuning strategy: Freeze early layers, train final layers
5c. Ensemble Methods
- Weighted average of multiple model predictions
- Stacking classifier combining CNN outputs
- Binary Relevance: Separate binary classifier per disease
- Classifier Chains: Sequential classifiers capturing label dependencies
- Problem Transformation: Multi-label to multi-class conversion
- Class Weights: Assign higher weights to rare diseases
- Focal Loss: Focus on hard-to-classify examples
- Oversampling: SMOTE for minority classes (traditional ML)
- Undersampling: Reduce majority class samples
- Threshold Optimization: Adjust decision thresholds per class
- AUC-ROC: Area under ROC curve (primary metric)
- AUC-PR: Precision-recall AUC (for imbalanced classes)
- Sensitivity/Recall: True positive rate (critical for medical screening)
- Specificity: True negative rate
- F1-Score: Harmonic mean of precision and recall
- Hamming Loss: Multi-label classification error
- Subset Accuracy: Exact match of all labels
- K-Fold Cross-Validation: 5-fold stratified CV
- Patient-Level Split: Ensure no patient data leakage between sets
- Temporal Validation: If timestamp data available
- Per-disease confusion matrices
- Multi-label confusion visualization
- Grad-CAM (Gradient-weighted Class Activation Mapping): Visual heatmaps showing which image regions influence predictions
- Saliency Maps: Highlight important pixels for classification
- Feature Importance: For traditional ML models
- Error Analysis: Study false positives and false negatives
- Clinical Validation: Compare predictions with radiologist annotations
- Hyperparameter Tuning: Learning rate, batch size, dropout rate
- Architecture Search: Layer depth, filter sizes
- Regularization: L2 weight decay, dropout tuning
- Early Stopping: Prevent overfitting using validation loss
- Learning Rate Scheduling: Reduce LR on plateau
Three pre-trained architectures were evaluated on the NIH Chest X-Ray dataset:
| Model | Parameters | Test AUC | Test Loss | Test Accuracy | Training Platform |
|---|---|---|---|---|---|
| DenseNet121 | 7.6M | 0.753 | 0.174 | 18.0% | Google Colab Pro+ A100 |
| ResNet50 | 23.6M | 0.681 | 0.199 | 23.6% | Google Colab Pro+ A100 |
| EfficientNetB3 | 10.7M | 0.535 | 0.199 | 10.1% | Google Colab Pro+ A100 |
Winner: DenseNet121 - Best balance of performance and model size
Evaluation on 16,890 test images across 14 disease classes:
| Disease | AUC-ROC | Positive Cases | Performance |
|---|---|---|---|
| Effusion | 0.755 | 2,064 (12.2%) | Excellent |
| Consolidation | 0.736 | 1,015 (6.0%) | Good |
| Cardiomegaly | 0.714 | 505 (3.0%) | Good |
| Atelectasis | 0.707 | 1,698 (10.1%) | Good |
| Fibrosis | 0.685 | 251 (1.5%) | Moderate |
| Infiltration | 0.655 | 3,179 (18.8%) | Moderate |
| Pneumothorax | 0.619 | 794 (4.7%) | Moderate |
| Mass | 0.611 | 899 (5.3%) | Moderate |
| Nodule | 0.598 | 979 (5.8%) | Fair |
| Pleural Thickening | 0.567 | 468 (2.8%) | Fair |
| Emphysema | 0.518 | 361 (2.1%) | Fair |
| Edema | 0.495 | 626 (3.7%) | Poor |
| Hernia | 0.453 | 47 (0.3%) | Poor |
| Pneumonia | 0.372 | 495 (2.9%) | Poor |
Average AUC: 0.606 across all diseases
- Best Performance: Effusion (0.755), Consolidation (0.736), Cardiomegaly (0.714)
- Poor Performance: Pneumonia (0.372), Hernia (0.453), Edema (0.495)
- Class Imbalance Impact: Rare diseases (Hernia: 47 cases) show lower performance
- Medical Significance: AUC >0.7 considered clinically useful for 4 diseases
Model attention analysis revealed:
- ✅ Positive: Model focuses on lung regions for most diseases
⚠️ Concern: Some predictions focus on areas outside lungs (mediastinum, diaphragm)- 📌 Implication: Suggests model may be learning spurious correlations
- 🔬 Action Needed: Further investigation and potential architectural improvements
- Healthcare Challenge: Current radiologist workload and diagnostic accuracy
- AI Solution: Automated screening and triage support
- Dataset Overview: NIH Chest X-Ray statistics (112K images, 15 conditions)
- Key Findings Summary: Model performance highlights, clinical insights
- Navigation Guide: Dashboard structure and user instructions
- Ethical Disclaimer: Tool is for research/educational purposes, not clinical use
- Patient Demographics:
- Age distribution histogram with disease overlays
- Gender distribution pie chart
- Interactive filters by age groups and gender
- Disease Distribution:
- Bar chart: Disease frequency (logarithmic scale for imbalance)
- Multi-label statistics: Co-occurrence heatmap
- Pie chart: Single vs multi-disease cases
- Sample Image Gallery:
- Grid display: One example per disease category
- Image viewer with zoom capability
- Healthy vs diseased comparison
- Statistical Summaries:
- Dataset size, patient count, image resolution
- Class imbalance metrics (Gini coefficient, imbalance ratio)
- Hypothesis 1 - Age Correlation:
- Box plots: Age distribution per disease
- Statistical test results (ANOVA p-values)
- Interpretation: Age-specific disease patterns
- Hypothesis 2 - Disease Co-occurrence:
- Interactive heatmap: Disease pair correlations
- Association rules table (support, confidence, lift)
- Clinical significance of findings
- Hypothesis 3 - Class Imbalance:
- Performance comparison: Balanced vs imbalanced models
- Per-class F1-score visualization
- Impact analysis
- Hypothesis 4 - Transfer Learning:
- Training curves: Accuracy and loss over epochs
- Model comparison table (AUC-ROC scores)
- Training time efficiency chart
- Hypothesis 5 - Gender Differences:
- Grouped bar charts: Disease prevalence by gender
- Chi-square test results
- Odds ratios with confidence intervals
- Overall Performance Metrics:
- Model comparison table (Logistic, Random Forest, CNN, Transfer Learning)
- Best model highlight with key metrics
- Per-Disease Performance:
- Interactive table: AUC-ROC, Sensitivity, Specificity, F1 per disease
- Sort and filter capabilities
- ROC Curves:
- Multi-class ROC plot (15 diseases)
- Interactive legend to toggle disease curves
- Precision-Recall Curves:
- Especially important for imbalanced classes
- Confusion Matrices:
- Dropdown selector for disease category
- Heatmap visualization
- Training History:
- Loss and accuracy curves (train vs validation)
- Early stopping indicator
- Image Upload Interface:
- Drag-and-drop or file browser
- Image preview display
- Prediction Results:
- Top 5 predicted diseases with confidence scores
- Probability bars for all 15 conditions
- Multi-label predictions highlighted
- Visual Explanation:
- Grad-CAM heatmap overlay on X-ray
- Regions of interest highlighted
- Toggle original vs heatmap view
- Clinical Context:
- Brief description of detected conditions
- Typical symptoms and severity indicators
- Disclaimer: Prominent note about non-clinical use
- Key Disease Patterns:
- Most common conditions
- Frequently co-occurring diseases
- Age and gender risk factors
- Model Strengths and Limitations:
- Diseases with highest accuracy
- Challenges with rare conditions
- Error analysis: Common misclassifications
- Clinical Applications:
- Triage workflow integration
- Second-opinion support
- Rural/under-resourced healthcare settings
- Future Improvements:
- Larger dataset requirements
- External validation needs
- Integration with PACS systems
- Ethical Considerations:
- Bias in dataset (population representation)
- Privacy and HIPAA compliance
- Human-in-the-loop necessity
- Regulatory approval requirements (FDA, CE marking)
- Medical-Grade Interface: Clean, professional, clinical aesthetic
- Color Scheme: Healthcare-appropriate (blues, whites, minimal red for alerts)
- Accessibility: WCAG 2.1 AA compliance, screen reader support
- Responsive Design: Desktop focus (radiologist workstations), mobile-friendly
- Clear Labels: Medical terminology with tooltips for explanations
- Performance: Fast loading for large images (lazy loading, caching)
- Privacy: No data retention, local processing only
CapStone/
│
├── .venv/ # Virtual environment (not tracked by Git)
├── data/
│ ├── raw/ # Original dataset from Kaggle
│ └── processed/ # Cleaned and transformed data
│
├── jupyter_notebooks/
│ ├── 01_data_collection.ipynb # Data import and initial exploration
│ ├── 02_data_cleaning.ipynb # Data cleaning and preprocessing
│ ├── 03_eda.ipynb # Exploratory data analysis
│ ├── 04_feature_engineering.ipynb # Feature creation and selection
│ ├── 05_modeling.ipynb # Model training and evaluation
│ └── 06_model_evaluation.ipynb # Final model assessment
│
├── src/
│ ├── data/
│ │ └── data_loader.py # Data loading utilities
│ ├── preprocessing/
│ │ ├── cleaning.py # Data cleaning functions
│ │ └── feature_engineering.py # Feature engineering functions
│ ├── modeling/
│ │ ├── train.py # Model training scripts
│ │ └── evaluate.py # Model evaluation scripts
│ └── visualization/
│ └── plots.py # Plotting functions
│
├── app/
│ ├── streamlit_app.py # Main Streamlit dashboard
│ └── pages/
│ ├── 1_summary.py # Project summary page
│ ├── 2_exploration.py # Data exploration page
│ ├── 3_hypothesis.py # Hypothesis validation page
│ ├── 4_prediction.py # Churn prediction page
│ └── 5_insights.py # Business insights page
│
├── kaggle/
│ ├── kernels/ # Kaggle kernel configurations
│ ├── datasets/ # Kaggle dataset uploads
│ ├── results/ # Downloaded Kaggle outputs
│ ├── scripts/ # Kaggle-specific scripts
│ ├── config/ # Test parameters and configs
│ └── legacy/ # Archived legacy files
│
├── colab/
│ └── *.ipynb # Google Colab notebooks
│
├── models/
│ └── saved_models/ # Trained model artifacts
│
├── docs/
│ ├── README.md # **📚 Complete Documentation Index (34 guides)**
│ ├── Assessment_Handbook.md # Project requirements
│ ├── PLATFORM_ORGANIZATION.md # Directory structure & platform separation
│ ├── NBPUSH_CLI.md # CLI tool for pushing notebooks to cloud GPUs
│ ├── KAGGLE_GUIDE.md # Complete Kaggle workflow
│ ├── COLAB_GUIDE.md # Complete Colab workflow
│ ├── MLFLOW_QUICKSTART.md # Experiment tracking quick start
│ └── ... (31 more guides - see docs/README.md for full index)
│
├── tests/
│ └── test_data_processing.py # Unit tests
│
├── .gitignore
├── .python-version
├── requirements.txt
├── Makefile # Automation scripts
├── README.md
└── LICENSE
- Python 3.12.8: Primary language for data analysis and application development
- TensorFlow 2.x: Deep learning framework for CNN development
- Keras: High-level neural network API (integrated with TensorFlow)
- OpenCV (cv2): Image loading, preprocessing, and manipulation
- Pillow (PIL): Image file handling
- scikit-image: Image processing algorithms
- albumentations: Advanced image augmentation library
- scikit-learn: Traditional ML algorithms (Logistic Regression, Random Forest) and evaluation metrics
- xgboost: Gradient boosting framework for baseline models
- imbalanced-learn: Handling class imbalance (SMOTE, class weights)
- scipy: Statistical functions and hypothesis testing
- pandas: Metadata manipulation and label management
- numpy: Numerical computing and array operations
- matplotlib: Static plotting library (training curves, distributions)
- seaborn: Statistical data visualization (heatmaps, box plots)
- plotly: Interactive visualizations for dashboard
- tf-keras-vis: Grad-CAM and visualization for TensorFlow/Keras
- keras-gradcam: Alternative Grad-CAM implementation
- shap: Model explainability (for traditional ML models)
- Streamlit: Interactive web dashboard framework
- streamlit-extras: Additional Streamlit components
- streamlit-drawable-canvas: Image annotation (if needed)
- jupyter: Interactive notebook environment
- nbstripout: Strip output from Jupyter notebooks for version control
- nbdime: Diff and merge for notebooks
- pytest: Unit testing framework
- black: Code formatting
- flake8: Code linting
- Git: Version control system
- GitHub: Repository hosting and collaboration
- GitHub Actions: CI/CD automation
- Python 3.9+ (3.12.8 recommended)
- Git installed on your machine
- Make utility (see installation instructions below)
- VS Code (recommended) or other IDE
- 10+ GB disk space for dataset
git clone <your-repository-url>
cd CapStonemacOS:
# Install Xcode Command Line Tools (includes make)
xcode-select --install
# Or via Homebrew
brew install makeLinux (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install build-essentialWindows:
# Option 1: Install via Chocolatey (recommended)
choco install make
# Option 2: Install via winget
winget install GnuWin32.Make
# Option 3: Use WSL (Windows Subsystem for Linux)
# Then follow Linux instructions aboveVerify make installation:
make --version
# Should show: GNU Make 4.x or similarIn VS Code:
- Open Command Palette (Ctrl+Shift+P or Cmd+Shift+P)
- Type "Python: Create Environment"
- Select "Venv"
- Choose Python 3.12.8 (or 3.9+)
- Do NOT select requirements.txt yet
Or via terminal:
python -m venv .venvWindows:
.venv\Scripts\activateMac/Linux:
source .venv/bin/activate# Install all dependencies
make install
# OR manually:
pip install --upgrade pip
pip install -r requirements.txt
pip install -e . # Install project in editable modeWhy pip install -e .?
- Installs the
src/directory as a Python package - Enables clean imports:
from preprocessing import ... - No need for
sys.pathmanipulation in notebooks - Required for VS Code Pylance to recognize custom modules
# Copy the example environment file
cp .env.example .env
# On Windows:
copy .env.example .envWhat's in .env?
# Adds src/ to Python path for VS Code Pylance
PYTHONPATH=${PYTHONPATH}:${workspaceFolder}/srcQuick verification:
python --version
jupyter --version
streamlit --version
make --versionComprehensive diagnostic:
make check-pylanceThis will verify:
- ✓ Python version and executable
- ✓ Source directory exists
- ✓ Python path includes src/
- ✓ Preprocessing module imports successfully
- ✓ Package is installed
- ✓ All configuration files exist
Expected output:
✅ SUCCESS: preprocessing module imported successfully!
✅ Package 'chest-xray-detection' is installed
✓ All configuration files exist
View available commands:
make helpOutput:
Usage:
make install - install dev tools and project deps
make lint - run Ruff lint only
make format - run Black format only
make typecheck - run Pyright type checker
make check-pylance - diagnose Pylance/import issues
make pre-commit - run lint and format
make app - run Streamlit dashboard
make clean - remove caches and temp files
Run pre-commit checks (before committing code):
make pre-commitCheck for type errors:
make typecheckjupyter notebook
# Or use VS Code's built-in notebook support (recommended)
# File → Open File → jupyter_notebooks/01_data_collection_and_setup.ipynbmake app
# OR manually:
streamlit run app/streamlit_app.pyIssue: VS Code shows "Cannot find module 'preprocessing'" in notebooks
Solution:
# 1. Ensure package is installed in editable mode
pip install -e .
# 2. Reload VS Code window
# Press Cmd+Shift+P → "Developer: Reload Window"
# 3. Run diagnostic
make check-pylance
# 4. See full guide
cat docs/PYLANCE_SETUP.mdIssue: make: command not found
Solution: Follow Step 2 to install make for your operating system
Issue: Import errors when running notebooks
Solution:
# Ensure you're in the correct directory
pwd # Should show: .../CapStone
# Activate virtual environment
source .venv/bin/activate # Mac/Linux
.venv\Scripts\activate # Windows
# Reinstall package
pip install -e .Issue: Jupyter kernel not found
Solution:
# Install ipykernel
pip install ipykernel
# Add environment to Jupyter
python -m ipykernel install --user --name=capstone --display-name="Python (Capstone)"
# In Jupyter/VS Code: Select kernel → Python (Capstone)For more detailed troubleshooting, see:
- Import issues:
docs/PYLANCE_SETUP.md - Data download issues: Check Notebook 01
- Environment setup:
docs/
- Download dataset from Kaggle
- Initial data inspection and quality assessment
- Document dataset characteristics
- Set up project structure and version control
- Handle missing values
- Remove duplicates and irrelevant features
- Outlier detection and treatment
- Create data quality report
- Univariate, bivariate, and multivariate analysis
- Statistical hypothesis testing
- Initial insights documentation
- Visualization development
- Create derived features
- Encoding categorical variables
- Feature scaling and normalization
- Feature selection and dimensionality reduction
- Baseline model creation (Logistic Regression)
- Advanced models (Random Forest, XGBoost)
- Custom Convolutional Neural Network
- Transfer Learning using open source models
- Model comparison and selection
- K-fold cross-validation
- Hyperparameter tuning
- Final model selection
- Streamlit app structure creation
- Page-by-page implementation
- Interactive visualization integration
- UX/UI refinement
- Unit testing for data processing functions
- Integration testing for dashboard
- README and code documentation completion
- Peer review and feedback incorporation
- Final testing in production environment
- Documentation review
- Project submission preparation
All Jupyter notebooks are automatically tested using pytest, nbmake, and Jupytext to ensure:
- ✅ Notebooks execute without errors (via
nbmake) - ✅ Deterministic results (fixed random seeds)
- ✅ No broken imports or dependencies
- ✅ Consistent execution in CI/CD
- ✅ Jupytext flat file (
.py) synchronization - ✅ Cross-platform compatibility (Kaggle, Colab, local)
Run tests locally:
# Run fast tests (default - recommended)
make test
# Run only fast notebooks (2 & 4)
make test-fast
# Run ALL notebooks including slow ones
make test-all
# Prepare notebooks for testing (sync Jupytext .py files)
make test-notebooksJupytext Integration:
- Notebooks maintained as both
.ipynb(outputs) and.py(version control) .pyfiles auto-sync with.ipynbon save- Tests run against both formats to ensure consistency
- Enables code review and diffs in Git
Test categories:
- Fast tests: Notebooks 02 (EDA) and 04 (Hypothesis Testing)
- Slow tests: Notebook 03 (Image Preprocessing)
- Skipped in CI: Notebook 01 (Data Download - 47GB)
CI/CD Integration:
- Notebooks automatically tested on every push/PR
- Tests run in parallel across Python 3.9-3.12
- Slow tests only run on main branch
- Fast tests complete in ~5 minutes
Configuration:
pytest.ini: Test configuration and markers.github/workflows/notebook-tests.yml: CI/CD workflowscripts/prepare_notebooks_for_testing.py: Notebook preparation
- Missing value checks
- Duplicate detection
- Data type validation
- Range and constraint validation
- Cross-Validation: 5-fold stratified k-fold for classification
- Train-Test Split: 70/15/15 split (train/val/test) with patient-level stratification
- Baseline Comparison: Compare against naive baselines
- Overfitting Check: Compare train vs validation metrics
- Expert Label Validation: Test on Google Cloud expert-validated labels
Located in tests/ directory - Specialized tests that validate notebook execution with dummy data:
# Run notebook-specific validation tests
pytest tests/
# Test specific notebook (e.g., notebook 08)
python tests/test_notebook_08.pyImplemented Tests:
- ✅
test_notebook_06.py- CNN development validation (320 lines) - ✅
test_notebook_07.py- Transfer learning validation (414 lines) - ✅
test_notebook_08.py- Model evaluation validation (303 lines) - ✅
test_colab_notebook.py- Colab notebook compatibility (326 lines) - ✅ Total: 47 test functions, 1,363 lines of test code
What These Tests Validate:
- Notebook structure and cell order
- Import statements and dependencies
- Model file paths and loading
- Data preprocessing pipelines
- Inference with dummy/sample data
- Output generation and validation
- Jupytext
.py↔.ipynbsynchronization
- Page load functionality
- Interactive component responsiveness
- Data filter functionality
- Visualization rendering
- Error handling
# Run all pre-commit checks
make pre-commit
# Format code
make format # or: black .
# Lint code
make lint # or: ruff check .
# Type checking
make typecheck # or: pyright src/
# Check Pylance configuration
make check-pylanceWhat's Tested:
- ✅ Notebook execution (all cells run top-to-bottom via
nbmake) - ✅ Import statements (no missing dependencies)
- ✅ Data loading and path resolution
- ✅ Preprocessing pipeline with dummy data
- ✅ Statistical analysis reproducibility
- ✅ Visualization generation
- ✅ Model loading and inference (notebooks 06-08)
- ✅ Jupytext synchronization (
.ipynb↔.pyflat files) - ✅ Cross-platform compatibility (Kaggle, Colab, local)
Test Types:
- Integration Tests: Full notebook execution with
pytest-nbmake(notebooks 02, 04) - Validation Tests: Specialized tests with mock/dummy data (notebooks 06-08, Colab)
- CI/CD Tests: Automated on every push/PR via GitHub Actions
What's NOT Tested in CI:
- ❌ Large data downloads (Notebook 01 - 47GB dataset)
- ❌ Full image preprocessing (Notebook 03 - marked as slow)
- ❌ Deep learning model training (7+ hour GPU jobs)
All notebooks include deterministic random seeds:
import random
import numpy as np
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)
# TensorFlow (if used)
tf.random.set_seed(RANDOM_SEED)This ensures consistent results across test runs.
The dashboard is deployed on Streamlit Community Cloud (Free Tier) with automatic continuous deployment.
Live Dashboard: https://nihxrays.streamlit.app
- Platform: Streamlit Community Cloud (free tier)
- Deployment: Automatic on every
git pushto main branch - Repository: Connected directly to GitHub repository
- Build: Streamlit Cloud automatically installs dependencies from
requirements.txt - Configuration: Settings in
.streamlit/config.tomlandsecrets.toml.example
app.py- Main Streamlit application entry pointrequirements.txt- Python dependencies (Streamlit auto-installs).streamlit/config.toml- Dashboard theme and settings.gitignore- Excludes large model files and data (models loaded from GitHub releases or cloud storage)
Note: Large model files (>100MB) are excluded from Git and loaded at runtime from alternative storage or smaller quantized versions for demo purposes
-
Migration to PyTorch ⭐
- Rationale: PyTorch is dominant in medical ML research and clinical deployments
- Benefits: Better community support, more medical imaging libraries (MONAI, TorchXRayVision)
- Effort: Medium (re-implement training pipeline, model architectures remain similar)
- Impact: Improved maintainability, easier integration with state-of-the-art methods
-
Higher Resolution Training (448x448) ⭐
- Current: 224x224 images (ImageNet standard)
- Proposed: 448x448 or 512x512 images
- Rationale: Medical images contain fine-grained details lost at lower resolutions
- Expected Improvement: +5-10% AUC, especially for nodules, masses, pneumothorax
- Challenge: 4x memory usage, longer training times
- Solution: Gradient accumulation, mixed-precision training (FP16)
-
Grad-CAM Investigation & Improvement 🔬
- Issue Identified: Model sometimes focuses on non-lung regions (mediastinum, diaphragm, image borders)
- Hypotheses:
- Spurious correlations (e.g., cardiomegaly correlated with heart silhouette position)
- Dataset bias (certain diseases more common with specific image characteristics)
- Insufficient lung segmentation during preprocessing
- Proposed Solutions:
- Add lung segmentation masks to focus model attention
- Implement spatial attention mechanisms
- Use Guided Grad-CAM or Integrated Gradients for better localization
- Create synthetic negative examples to reduce spurious correlations
-
Class Imbalance Mitigation
- Current Issue: Hernia (47 cases), Pneumonia (495 cases) have AUC <0.5
- Strategies:
- Focal loss with disease-specific gamma parameters
- SMOTE-like oversampling in embedding space
- Class-balanced loss weighting
- Transfer learning from CheXpert dataset (larger, more balanced)
-
REST API Deployment: Develop RESTful API for X-ray image upload and real-time disease prediction
-
Bounding Box Detection: Implement object detection to localize disease regions (using BBox annotations)
-
Model Ensemble: Combine predictions from multiple architectures for improved accuracy
-
Additional Augmentation: Experiment with CutMix, MixUp, and other advanced augmentation techniques
-
Uncertainty Quantification: Implement Monte Carlo Dropout or Bayesian networks to provide prediction confidence intervals
-
Multi-View Integration: Combine frontal and lateral X-ray views for improved diagnosis
-
Temporal Analysis: Track disease progression over time for individual patients with longitudinal data
-
Attention Mechanisms: Implement attention-based architectures (Vision Transformers, Swin Transformers) for better interpretability
-
External Validation: Test model on independent datasets:
- CheXpert (Stanford, 224K images)
- MIMIC-CXR (MIT, 377K images)
- PadChest (Spain, 160K images)
- COVID-19 datasets for generalization testing
-
Federated Learning: Enable privacy-preserving model training across multiple hospitals
-
Report Generation: Automatic radiological report generation from X-ray images (image captioning)
-
Clinical Deployment: Integration with hospital PACS (Picture Archiving and Communication Systems)
-
FDA/CE Approval: Pursue regulatory approval for clinical decision support tool (Class II medical device)
-
Multi-Modal Fusion: Combine X-rays with CT scans, MRI, and patient electronic health records (EHR)
-
Real-Time Triage System: Automated prioritization of urgent cases in emergency departments
-
Mobile Diagnostic Tool: Point-of-care diagnostic app for resource-limited settings
-
Continuous Learning: MLOps pipeline with automated retraining as new annotated data becomes available
-
3D Reconstruction: Generate 3D chest models from 2D X-rays using deep learning
-
Treatment Recommendation: Integrate with clinical guidelines to suggest treatment protocols
-
Primary Dataset: NIH Chest X-Ray Dataset on Kaggle
-
Original Publication: Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thoracic Diseases. IEEE CVPR 2017
-
Citation:
@inproceedings{wang2017chestx, title={Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thoracic diseases}, author={Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M}, booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition}, pages={2097--2106}, year={2017} } -
License: CC0: Public Domain
-
Acknowledgment: National Institutes of Health Clinical Center
-
Enhanced Expert Labels: Google Cloud Healthcare - NIH Chest X-Ray Additional Labels
-
Provider: Google LLC / Google Health AI
-
Storage: Google Cloud Storage public bucket
gs://gcs-public-data--healthcare-nih-chest-xray-labels -
Quality: Expert radiologist annotations via adjudicated review, higher quality than text-mined labels
-
License: Same as NIH dataset (CC0: Public Domain)
-
Access Method: Google Cloud Storage (
gsutil) or HTTP download -
Four Findings Expert Labels Citation:
Majkowska A, Mittal S, Steiner DF, et al. Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation. Radiology. 2020;294(2):421-431. doi:10.1148/radiol.2019191293 -
All Findings Expert Labels Citation:
Nabulsi Z, Sellergren A, Jamshy S, et al. Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Two Unseen Diseases Tuberculosis and COVID-19. Sci Rep. 2021;11:15523. doi:10.1038/s41598-021-93967-2 -
Acknowledgment: Google Cloud Healthcare team and radiologist co-authors for curating and providing expert-validated labels
- Code Institute: Data Analytics & AI Bootcamp curriculum and support
- TensorFlow/Keras Documentation: Deep learning implementation guidance
- Scikit-learn Documentation: Machine learning and metrics
- Streamlit Documentation: Dashboard development resources
- Stanford CS231n: Convolutional Neural Networks for Visual Recognition course materials
- Papers with Code: Medical imaging benchmarks and sota models
All external code snippets and inspirations are documented inline with appropriate attribution:
- Grad-CAM implementation: Adapted from keras-vis and tf-keras-vis documentation
- Data augmentation: Based on albumentations library examples
- Multi-label classification: Scikit-learn multi-label approaches
- Transfer learning: Keras Applications pre-trained models
- Medical imaging preprocessing: Techniques from radiology AI literature
Dataset Papers:
- Wang X, Peng Y, Lu L, et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thoracic Diseases. IEEE CVPR, 2017.
- Majkowska A, Mittal S, Steiner DF, et al. Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation. Radiology, 2020;294(2):421-431.
- Nabulsi Z, Sellergren A, Jamshy S, et al. Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Two Unseen Diseases Tuberculosis and COVID-19. Sci Rep, 2021;11:15523.
Deep Learning Methods:
- He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. IEEE CVPR, 2016.
- Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. IEEE CVPR, 2017.
- Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV, 2017.
Grateful acknowledgment to the open-source community for the excellent tools used in this project (see Technologies Used section).
- Code Institute instructors and mentors
- Peer reviewers from cohort
- Stack Overflow and Kaggle community for troubleshooting support
- Medical imaging research community for best practices
This project is created for educational purposes as part of the Code Institute Data Analytics & AI Bootcamp capstone project.
For questions or feedback regarding this project, please open an issue in the GitHub repository or contact the project maintainer.
Last Updated: November 2025