Jupyter Notebook Complete Walkthrough
A comprehensive guide to the Credit_Risk_Dataset.ipynb notebook that trains our Machine Learning model
📓 View on Google Colab
Access the complete notebook with interactive code cells and run it yourself!
📚 Table of Contents
Overview
The Credit_Risk_Dataset.ipynb notebook is where we train our Random Forest Classifier model.
This notebook follows a complete Machine Learning workflow from data loading to model saving.
📊 Dataset Information
- Dataset: Credit Risk Dataset from Kaggle
- Total Records: 32,581 loan applications
- Features: 12 original features + 1 engineered feature (interest_burden)
- Target: loan_status (0 = Safe, 1 = Risky)
- Class Distribution: 78% Safe, 22% Risky (imbalanced dataset)
1️⃣ Import Libraries
We start by importing all the necessary libraries for data processing, visualization, and machine learning.
# 1. Environment Setup | إعداد البيئة
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub
import os
# Machine Learning Libraries | مكتبات تعلم الآلة
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
What Each Library Does:
- pandas & numpy: Data manipulation and numerical operations
- matplotlib & seaborn: Data visualization and plotting
- kagglehub: Download datasets directly from Kaggle
- sklearn: Machine learning tools (preprocessing, models, evaluation)
2️⃣ Load Data & Feature Engineering
We download the dataset from Kaggle and create a new feature called interest_burden.
Step 1: Download Dataset
# Download dataset | تحميل البيانات
path = kagglehub.dataset_download("laotse/credit-risk-dataset")
print("Path:", path)
What This Does: Downloads the Credit Risk Dataset from Kaggle using the kagglehub library. The dataset is automatically cached for faster access in future runs.
Step 2: Load CSV File
# Load the CSV file
df = pd.read_csv('credit_risk_dataset.csv')
print(f"Shape: {df.shape}")
df.head()
Result: Dataset loaded with 32,581 rows and 12 columns.
Step 3: Feature Engineering ⭐
# --- Feature Engineering Step ---
# Create 'Interest Burden' = (Loan Amount * Interest Rate) / Income
# إنشاء ميزة جديدة: نسبة عبء الفائدة إلى الدخل
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0) # Handle potential division by zero
print(f"Data Shape: {df.shape}")
print("✅ Feature Engineering Complete")
🔑 Why Feature Engineering?
interest_burden is a calculated feature that shows what percentage of a person's income goes toward loan interest payments.
Formula: interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income
Why It Matters:
- A lower interest burden means the person can more easily afford loan payments
- This single number captures an important financial relationship
- Helps the model make better predictions about default risk
Result: Dataset now has 13 columns (12 original + 1 new feature).
3️⃣ Data Cleaning & Sanity Checks
We remove outliers and check data quality to ensure our model trains on clean, realistic data.
# 3. Sanity Checks (Cleaning) | التحقق المنطقي والتنظيف
# Remove impossible ages (Outliers)
# حذف الأعمار المستحيلة (أكبر من 100 سنة)
df_clean = df[df['person_age'] <= 100].copy()
# Check Imbalance
# التحقق من توازن البيانات
print("\nTarget Distribution (Loan Status):")
print(df_clean['loan_status'].value_counts(normalize=True))
What This Does:
- Removes Outliers: Filters out records where age > 100 (impossible values)
- Checks Class Distribution: Shows the proportion of Safe (0) vs Risky (1) loans
Class Distribution Result:
- Safe (0): 78.18%
- Risky (1): 21.82%
⚠️ This is an imbalanced dataset - we'll use stratified splitting to handle this!
Data Exploration
The notebook also includes:
df.info(): Shows data types and missing valuesdf.describe(): Shows statistical summary (mean, min, max, etc.)df.head(): Displays first few rows
4️⃣ Data Splitting
We split the data into training and testing sets using stratified splitting to handle class imbalance.
# 4. Stratified Splitting | التقسيم الطبقي
X = df_clean.drop('loan_status', axis=1)
y = df_clean['loan_status']
# Split 80% Train, 20% Test with Stratify
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nTraining Set: {X_train.shape}")
print(f"Test Set: {X_test.shape}")
💡 Why Stratified Splitting?
stratify=y ensures that both training and test sets have the same proportion of Safe vs Risky loans as the original dataset.
Why This Matters:
- Prevents the test set from having too many or too few risky loans
- Ensures fair evaluation of the model
- Critical for imbalanced datasets like ours (78% safe, 22% risky)
Result:
- Training Set: 26,060 samples (80%)
- Test Set: 6,516 samples (20%)
5️⃣ Preprocessing Pipeline
We build a Pipeline to automatically handle missing values, encode categorical data, and scale numerical features.
# 5. Preprocessing Pipeline | خط المعالجة
# Define transformers for numeric and categorical columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', sparse_output=False))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
Pipeline Steps Explained:
For Numerical Features:
- SimpleImputer: Fills missing values with the mean (average) of that column
- StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)
For Categorical Features:
- SimpleImputer: Fills missing values with the most frequent (mode) value
- OneHotEncoder: Converts text categories (like "RENT", "OWN") into numbers (0s and 1s)
🔧 Why Use a Pipeline?
- Consistency: Same preprocessing applied during training AND prediction
- Organization: All preprocessing steps in one place
- Easier Deployment: Save the entire pipeline, not just the model
- Prevents Bugs: Can't forget a preprocessing step!
6️⃣ Model Training
We combine the preprocessing pipeline with the Random Forest Classifier and train the model.
# 6. Model Training | تدريب النموذج
# Define the full pipeline with the model
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=100,
random_state=42,
n_jobs=-1
))
])
print("\n🚀 Training Model...")
model.fit(X_train, y_train)
print("✅ Training Complete!")
🎯 Random Forest Parameters:
- n_estimators=100: Creates 100 decision trees (more trees = better accuracy, but slower)
- random_state=42: Ensures reproducible results (same random seed every time)
- n_jobs=-1: Uses all CPU cores for faster training
What Happens During Training:
- The preprocessor transforms the training data (impute, scale, encode)
- Random Forest builds 100 decision trees
- Each tree learns patterns from the data
- The trees vote together to make predictions
Training Time: Usually takes a few minutes depending on your computer's speed.
7️⃣ Loss Curve Visualization
We visualize the training process using a Neural Network to ensure the model is learning properly.
# We use a separate MLP model just to visualize the loss curve
diag_model = MLPClassifier(hidden_layer_sizes=(32,), max_iter=50, random_state=42)
diag_model.fit(X_train_transformed, y_train)
# Plot the loss curve
plt.plot(diag_model.loss_curve_, label='Training Loss', color='blue')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training Loss Curve')
plt.legend()
plt.show()
Why Visualize Loss?
- Confirms Learning: If loss decreases over time, the model is learning
- Detects Overfitting: If loss stops decreasing, we might need to stop training
- Diagnostic Tool: Helps identify training problems early
Note: Random Forest doesn't have a loss curve (it's not iterative), so we use MLPClassifier just for visualization purposes.
8️⃣ Model Evaluation
We test the trained model on unseen test data to measure its performance.
# 8. Evaluation | التقييم
y_pred = model.predict(X_test)
# Classification Report
print(classification_report(y_test, y_pred))
# Confusion Matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()
📊 Evaluation Metrics:
- Accuracy: Overall percentage of correct predictions
- Precision: Of all predicted risky loans, how many were actually risky?
- Recall: Of all actual risky loans, how many did we catch?
- F1-Score: Balance between precision and recall
Confusion Matrix:
A table showing:
- True Positives: Correctly predicted risky loans
- True Negatives: Correctly predicted safe loans
- False Positives: Predicted risky but actually safe (Type I error)
- False Negatives: Predicted safe but actually risky (Type II error)
Goal: Maximize True Positives and True Negatives, minimize False Positives and False Negatives.
9️⃣ Model Saving
Finally, we save the trained model so we can use it in our Flask web application.
import joblib
# Save the complete pipeline (preprocessor + model)
joblib.dump(model, 'credit_risk_model.pkl')
print("تم حفظ الموديل! قم بتنزيل ملف credit_risk_model.pkl من القائمة الجانبية")
print("Model saved! Download credit_risk_model.pkl from the sidebar")
💾 Why Save the Complete Pipeline?
We save the entire pipeline (preprocessor + classifier), not just the model, because:
- The preprocessor is needed for every prediction
- Ensures consistent preprocessing (same as training)
- Simplifies deployment - just load and use!
File Size: The saved model is typically 20-50 MB (before compression).
Next Steps:
- Download the
credit_risk_model.pklfile from Google Colab - Place it in the same folder as your
app.pyfile - Use
joblib.load()in your Flask app to load the model - Start making predictions!
🔄 Complete Workflow Summary
- Import Libraries → Load all necessary tools
- Load Data → Download dataset from Kaggle
- Feature Engineering → Create
interest_burdenfeature - Data Cleaning → Remove outliers and check quality
- Data Splitting → Split into train/test with stratification
- Build Pipeline → Create preprocessing pipeline
- Train Model → Fit Random Forest on training data
- Visualize → Plot loss curve (diagnostic)
- Evaluate → Test on unseen data
- Save Model → Export for use in Flask app
🎯 Key Takeaways
- Feature Engineering: Creating
interest_burdenimproves model accuracy - Stratified Splitting: Essential for imbalanced datasets
- Pipeline: Ensures consistent preprocessing during training and prediction
- Random Forest: Powerful algorithm that works well on structured financial data
- Evaluation: Always test on unseen data to measure real performance
- Model Saving: Save the complete pipeline, not just the classifier
Ready to Explore?
Open the notebook on Google Colab to see the code in action and run it yourself!
🚀 Open Notebook on Google Colab