Skip to content

Anton-Atef/fraud-detection-random-forest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’³ Credit Card Fraud Detection using Random Forest + SMOTE

License Python Kaggle


πŸ“Œ Overview

πŸš€ This project detects fraudulent credit card transactions using a Random Forest Classifier enhanced with SMOTE (Synthetic Minority Over-sampling Technique).
πŸ“Š Built on the Kaggle dataset, it includes preprocessing, resampling, model training, threshold tuning, evaluation, and feature importance analysis.


πŸ“ Dataset Description

πŸ“¦ Records: 284,807 transactions
🧬 Features:

  • V1–V28: PCA-anonymized features
  • Amount: Transaction amount (scaled)
  • Time: Seconds since first transaction (dropped)
  • Class: Target (0 = Legit βœ…, 1 = Fraud ❌)

⚠️ Class Imbalance:

  • Legit: 284,315 🟒
  • Fraud: 492 πŸ”΄

πŸ“Š Exploratory Data Analysis (EDA)

πŸ“‰ Visualized:

  • Class distribution
  • Amount distribution by class
  • Correlation heatmaps (features vs Class)
  • Boxplots (e.g., V14 vs Class)
  • Hourly frequency of transactions
  • 2D PCA scatter plot

βš™οΈ Data Preprocessing & SMOTE

πŸ”„ Preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df['Amount'] = StandardScaler().fit_transform(df[['Amount']])
X = df.drop(['Time', 'Class'], axis=1)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

πŸ§ͺ Apply SMOTE

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)

🌲 Model: Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rfc.fit(X_train_resampled, y_train_resampled)

🧠 Trained with class_weight='balanced' to handle imbalance
🌐 100 decision trees used


🎯 Threshold Tuning for Better Recall

from sklearn.metrics import precision_recall_curve

y_proba = rfc.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

optimal_threshold = 0.3
y_pred_adjusted = (y_proba > optimal_threshold).astype(int)

🎚️ Adjusted threshold from default (0.5) to improve fraud recall


πŸ§ͺ Evaluation Metrics

πŸ” Metric πŸ“ˆ Value
Accuracy 99.75%
ROC AUC Score 0.99
Precision (Fraud) ~0.87
Recall (Fraud) ~0.78
F1-Score (Fraud) ~0.82

🧾 Confusion Matrix:

[[85268    27]
 [   32   116]]

πŸ“‹ Classification Report:

  • Class 0: Precision = 1.00, Recall = 1.00
  • Class 1: Precision = 0.87, Recall = 0.78

🌟 Feature Importance

πŸ“Œ Most important features by Random Forest:

  • V17, V14, V12, V10

πŸ“Š Visualized with horizontal bar plot


πŸ“‚ Project Structure

creditcard-fraud-detection/
β”‚
β”œβ”€β”€ notebook.ipynb              # πŸ” Full implementation and analysis
β”œβ”€β”€ README.md                   # πŸ“˜ Project overview
β”œβ”€β”€ images/                     # πŸ–ΌοΈ Visuals and plots
└── requirements.txt            # πŸ“¦ Python dependencies

βœ… Key Takeaways

βœ”οΈ Random Forest + SMOTE = Powerful combo for imbalanced fraud detection
πŸ“ˆ Threshold tuning improves recall for fraud cases
πŸ“Š Features V14, V17, V12, and V10 are highly informative
πŸ’‘ Easy to interpret, scalable, and reproducible


πŸ› οΈ Future Improvements

πŸ“Œ Try alternative models:

  • XGBoost 🌲
  • LightGBM ⚑
  • Logistic Regression πŸ“ˆ

πŸ§ͺ Add:

  • GridSearchCV for hyperparameter tuning
  • Real-time deployment using Flask / Gradio / Streamlit

πŸ“¦ Dependencies

numpy
pandas
matplotlib
seaborn
scikit-learn
imblearn

πŸ“œ License

MIT License Β© 2025 Anton Atef


🀝 Contributions

πŸ‘¨β€πŸ’» Feel free to fork, clone, and submit pull requests!
πŸ“¬ Suggestions and issues are welcome anytime!


πŸ“¬ Contact

πŸ“§ Email: tony.atef.954@gmail.com


About

πŸ“Credit card fraud detection using Random Forest with full EDA, preprocessing, evaluation, and visualizations. Based on Kaggle ULB dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors