Jupyter Notebook Explanation - Smart Credit Risk System

📓 View on Google Colab

Access the complete notebook with interactive code cells and run it yourself!

📚 Table of Contents

Overview
1. Import Libraries
2. Load Data & Feature Engineering
3. Data Cleaning & Sanity Checks
4. Data Splitting
5. Preprocessing Pipeline
6. Model Training
7. Loss Curve Visualization
8. Model Evaluation
9. Model Saving

Overview

The Credit_Risk_Dataset.ipynb notebook is where we train our Random Forest Classifier model. This notebook follows a complete Machine Learning workflow from data loading to model saving.

                📊 Dataset Information
                Dataset: Credit Risk Dataset from Kaggle
Total Records: 32,581 loan applications
Features: 12 original features + 1 engineered feature (interest_burden)
Target: loan_status (0 = Safe, 1 = Risky)
Class Distribution: 78% Safe, 22% Risky (imbalanced dataset)

            

1️⃣ Import Libraries

We start by importing all the necessary libraries for data processing, visualization, and machine learning.

# 1. Environment Setup | إعداد البيئة
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub
import os

# Machine Learning Libraries | مكتبات تعلم الآلة
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
            

What Each Library Does:

pandas & numpy: Data manipulation and numerical operations
matplotlib & seaborn: Data visualization and plotting
kagglehub: Download datasets directly from Kaggle
sklearn: Machine learning tools (preprocessing, models, evaluation)

2️⃣ Load Data & Feature Engineering

We download the dataset from Kaggle and create a new feature called interest_burden.

Step 1: Download Dataset

# Download dataset | تحميل البيانات
path = kagglehub.dataset_download("laotse/credit-risk-dataset")
print("Path:", path)
            

What This Does: Downloads the Credit Risk Dataset from Kaggle using the kagglehub library. The dataset is automatically cached for faster access in future runs.

Step 2: Load CSV File

# Load the CSV file
df = pd.read_csv('credit_risk_dataset.csv')
print(f"Shape: {df.shape}")
df.head()
            

Result: Dataset loaded with 32,581 rows and 12 columns.

Step 3: Feature Engineering ⭐

# --- Feature Engineering Step ---
# Create 'Interest Burden' = (Loan Amount * Interest Rate) / Income
# إنشاء ميزة جديدة: نسبة عبء الفائدة إلى الدخل
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0) # Handle potential division by zero

print(f"Data Shape: {df.shape}")
print("✅ Feature Engineering Complete")
            

🔑 Why Feature Engineering?

interest_burden is a calculated feature that shows what percentage of a person's income goes toward loan interest payments.

Formula: interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income

Why It Matters:

A lower interest burden means the person can more easily afford loan payments
This single number captures an important financial relationship
Helps the model make better predictions about default risk

Result: Dataset now has 13 columns (12 original + 1 new feature).

3️⃣ Data Cleaning & Sanity Checks

We remove outliers and check data quality to ensure our model trains on clean, realistic data.

# 3. Sanity Checks (Cleaning) | التحقق المنطقي والتنظيف
# Remove impossible ages (Outliers)
# حذف الأعمار المستحيلة (أكبر من 100 سنة)
df_clean = df[df['person_age'] <= 100].copy()

# Check Imbalance
# التحقق من توازن البيانات
print("\nTarget Distribution (Loan Status):")
print(df_clean['loan_status'].value_counts(normalize=True))
            

What This Does:

Removes Outliers: Filters out records where age > 100 (impossible values)
Checks Class Distribution: Shows the proportion of Safe (0) vs Risky (1) loans

Class Distribution Result:

Safe (0): 78.18%
Risky (1): 21.82%

⚠️ This is an imbalanced dataset - we'll use stratified splitting to handle this!

Data Exploration

The notebook also includes:

df.info(): Shows data types and missing values
df.describe(): Shows statistical summary (mean, min, max, etc.)
df.head(): Displays first few rows

4️⃣ Data Splitting

We split the data into training and testing sets using stratified splitting to handle class imbalance.

# 4. Stratified Splitting | التقسيم الطبقي
X = df_clean.drop('loan_status', axis=1)
y = df_clean['loan_status']

# Split 80% Train, 20% Test with Stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining Set: {X_train.shape}")
print(f"Test Set: {X_test.shape}")
            

💡 Why Stratified Splitting?

stratify=y ensures that both training and test sets have the same proportion of Safe vs Risky loans as the original dataset.

Why This Matters:

Prevents the test set from having too many or too few risky loans
Ensures fair evaluation of the model
Critical for imbalanced datasets like ours (78% safe, 22% risky)

Result:

Training Set: 26,060 samples (80%)
Test Set: 6,516 samples (20%)

5️⃣ Preprocessing Pipeline

We build a Pipeline to automatically handle missing values, encode categorical data, and scale numerical features.

# 5. Preprocessing Pipeline | خط المعالجة
# Define transformers for numeric and categorical columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)
            

Pipeline Steps Explained:

For Numerical Features:

SimpleImputer: Fills missing values with the mean (average) of that column
StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)

For Categorical Features:

SimpleImputer: Fills missing values with the most frequent (mode) value
OneHotEncoder: Converts text categories (like "RENT", "OWN") into numbers (0s and 1s)

                🔧 Why Use a Pipeline?
                Consistency: Same preprocessing applied during training AND prediction
Organization: All preprocessing steps in one place
Easier Deployment: Save the entire pipeline, not just the model
Prevents Bugs: Can't forget a preprocessing step!

            

6️⃣ Model Training

We combine the preprocessing pipeline with the Random Forest Classifier and train the model.

# 6. Model Training | تدريب النموذج
# Define the full pipeline with the model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
])

print("\n🚀 Training Model...")
model.fit(X_train, y_train)
print("✅ Training Complete!")
            

                🎯 Random Forest Parameters:
                n_estimators=100: Creates 100 decision trees (more trees = better accuracy, but slower)
random_state=42: Ensures reproducible results (same random seed every time)
n_jobs=-1: Uses all CPU cores for faster training

            

What Happens During Training:

The preprocessor transforms the training data (impute, scale, encode)
Random Forest builds 100 decision trees
Each tree learns patterns from the data
The trees vote together to make predictions

Training Time: Usually takes a few minutes depending on your computer's speed.

7️⃣ Loss Curve Visualization

We visualize the training process using a Neural Network to ensure the model is learning properly.

# We use a separate MLP model just to visualize the loss curve
diag_model = MLPClassifier(hidden_layer_sizes=(32,), max_iter=50, random_state=42)
diag_model.fit(X_train_transformed, y_train)

# Plot the loss curve
plt.plot(diag_model.loss_curve_, label='Training Loss', color='blue')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training Loss Curve')
plt.legend()
plt.show()
            

Why Visualize Loss?

Confirms Learning: If loss decreases over time, the model is learning
Detects Overfitting: If loss stops decreasing, we might need to stop training
Diagnostic Tool: Helps identify training problems early

Note: Random Forest doesn't have a loss curve (it's not iterative), so we use MLPClassifier just for visualization purposes.

8️⃣ Model Evaluation

We test the trained model on unseen test data to measure its performance.

# 8. Evaluation | التقييم
y_pred = model.predict(X_test)

# Classification Report
print(classification_report(y_test, y_pred))

# Confusion Matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()
            

                📊 Evaluation Metrics:
                Accuracy: Overall percentage of correct predictions
Precision: Of all predicted risky loans, how many were actually risky?
Recall: Of all actual risky loans, how many did we catch?
F1-Score: Balance between precision and recall

            

Confusion Matrix:

A table showing:

True Positives: Correctly predicted risky loans
True Negatives: Correctly predicted safe loans
False Positives: Predicted risky but actually safe (Type I error)
False Negatives: Predicted safe but actually risky (Type II error)

Goal: Maximize True Positives and True Negatives, minimize False Positives and False Negatives.

9️⃣ Model Saving

Finally, we save the trained model so we can use it in our Flask web application.

import joblib

# Save the complete pipeline (preprocessor + model)
joblib.dump(model, 'credit_risk_model.pkl')

print("تم حفظ الموديل! قم بتنزيل ملف credit_risk_model.pkl من القائمة الجانبية")
print("Model saved! Download credit_risk_model.pkl from the sidebar")
            

💾 Why Save the Complete Pipeline?

We save the entire pipeline (preprocessor + classifier), not just the model, because:

The preprocessor is needed for every prediction
Ensures consistent preprocessing (same as training)
Simplifies deployment - just load and use!

File Size: The saved model is typically 20-50 MB (before compression).

Next Steps:

Download the credit_risk_model.pkl file from Google Colab
Place it in the same folder as your app.py file
Use joblib.load() in your Flask app to load the model
Start making predictions!

🔄 Complete Workflow Summary

                Import Libraries → Load all necessary tools
Load Data → Download dataset from Kaggle
Feature Engineering → Create interest_burden feature
Data Cleaning → Remove outliers and check quality
Data Splitting → Split into train/test with stratification
Build Pipeline → Create preprocessing pipeline
Train Model → Fit Random Forest on training data
Visualize → Plot loss curve (diagnostic)
Evaluate → Test on unseen data
Save Model → Export for use in Flask app

            

🎯 Key Takeaways

Feature Engineering: Creating interest_burden improves model accuracy
Stratified Splitting: Essential for imbalanced datasets
Pipeline: Ensures consistent preprocessing during training and prediction
Random Forest: Powerful algorithm that works well on structured financial data
Evaluation: Always test on unseen data to measure real performance
Model Saving: Save the complete pipeline, not just the classifier

Ready to Explore?

Open the notebook on Google Colab to see the code in action and run it yourself!

🚀 Open Notebook on Google Colab

Jupyter Notebook Complete Walkthrough