📓 View on Google Colab

Access the complete notebook with interactive code cells and run it yourself!

📚 Table of Contents

Overview

The Credit_Risk_Dataset.ipynb notebook is where we train our Random Forest Classifier model. This notebook follows a complete Machine Learning workflow from data loading to model saving.

📊 Dataset Information

  • Dataset: Credit Risk Dataset from Kaggle
  • Total Records: 32,581 loan applications
  • Features: 12 original features + 1 engineered feature (interest_burden)
  • Target: loan_status (0 = Safe, 1 = Risky)
  • Class Distribution: 78% Safe, 22% Risky (imbalanced dataset)

1️⃣ Import Libraries

We start by importing all the necessary libraries for data processing, visualization, and machine learning.

# 1. Environment Setup | إعداد البيئة import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import kagglehub import os # Machine Learning Libraries | مكتبات تعلم الآلة from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.neural_network import MLPClassifier from sklearn.metrics import classification_report, ConfusionMatrixDisplay

What Each Library Does:

  • pandas & numpy: Data manipulation and numerical operations
  • matplotlib & seaborn: Data visualization and plotting
  • kagglehub: Download datasets directly from Kaggle
  • sklearn: Machine learning tools (preprocessing, models, evaluation)

2️⃣ Load Data & Feature Engineering

We download the dataset from Kaggle and create a new feature called interest_burden.

Step 1: Download Dataset

# Download dataset | تحميل البيانات path = kagglehub.dataset_download("laotse/credit-risk-dataset") print("Path:", path)

What This Does: Downloads the Credit Risk Dataset from Kaggle using the kagglehub library. The dataset is automatically cached for faster access in future runs.

Step 2: Load CSV File

# Load the CSV file df = pd.read_csv('credit_risk_dataset.csv') print(f"Shape: {df.shape}") df.head()

Result: Dataset loaded with 32,581 rows and 12 columns.

Step 3: Feature Engineering ⭐

# --- Feature Engineering Step --- # Create 'Interest Burden' = (Loan Amount * Interest Rate) / Income # إنشاء ميزة جديدة: نسبة عبء الفائدة إلى الدخل df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income'] df['interest_burden'] = df['interest_burden'].fillna(0) # Handle potential division by zero print(f"Data Shape: {df.shape}") print("✅ Feature Engineering Complete")

🔑 Why Feature Engineering?

interest_burden is a calculated feature that shows what percentage of a person's income goes toward loan interest payments.

Formula: interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income

Why It Matters:

  • A lower interest burden means the person can more easily afford loan payments
  • This single number captures an important financial relationship
  • Helps the model make better predictions about default risk

Result: Dataset now has 13 columns (12 original + 1 new feature).

3️⃣ Data Cleaning & Sanity Checks

We remove outliers and check data quality to ensure our model trains on clean, realistic data.

# 3. Sanity Checks (Cleaning) | التحقق المنطقي والتنظيف # Remove impossible ages (Outliers) # حذف الأعمار المستحيلة (أكبر من 100 سنة) df_clean = df[df['person_age'] <= 100].copy() # Check Imbalance # التحقق من توازن البيانات print("\nTarget Distribution (Loan Status):") print(df_clean['loan_status'].value_counts(normalize=True))

What This Does:

  • Removes Outliers: Filters out records where age > 100 (impossible values)
  • Checks Class Distribution: Shows the proportion of Safe (0) vs Risky (1) loans

Class Distribution Result:

  • Safe (0): 78.18%
  • Risky (1): 21.82%

⚠️ This is an imbalanced dataset - we'll use stratified splitting to handle this!

Data Exploration

The notebook also includes:

4️⃣ Data Splitting

We split the data into training and testing sets using stratified splitting to handle class imbalance.

# 4. Stratified Splitting | التقسيم الطبقي X = df_clean.drop('loan_status', axis=1) y = df_clean['loan_status'] # Split 80% Train, 20% Test with Stratify X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"\nTraining Set: {X_train.shape}") print(f"Test Set: {X_test.shape}")

💡 Why Stratified Splitting?

stratify=y ensures that both training and test sets have the same proportion of Safe vs Risky loans as the original dataset.

Why This Matters:

  • Prevents the test set from having too many or too few risky loans
  • Ensures fair evaluation of the model
  • Critical for imbalanced datasets like ours (78% safe, 22% risky)

Result:

  • Training Set: 26,060 samples (80%)
  • Test Set: 6,516 samples (20%)

5️⃣ Preprocessing Pipeline

We build a Pipeline to automatically handle missing values, encode categorical data, and scale numerical features.

# 5. Preprocessing Pipeline | خط المعالجة # Define transformers for numeric and categorical columns numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(drop='first', sparse_output=False)) ]) # Combine transformers preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ] )

Pipeline Steps Explained:

For Numerical Features:

  1. SimpleImputer: Fills missing values with the mean (average) of that column
  2. StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)

For Categorical Features:

  1. SimpleImputer: Fills missing values with the most frequent (mode) value
  2. OneHotEncoder: Converts text categories (like "RENT", "OWN") into numbers (0s and 1s)

🔧 Why Use a Pipeline?

  • Consistency: Same preprocessing applied during training AND prediction
  • Organization: All preprocessing steps in one place
  • Easier Deployment: Save the entire pipeline, not just the model
  • Prevents Bugs: Can't forget a preprocessing step!

6️⃣ Model Training

We combine the preprocessing pipeline with the Random Forest Classifier and train the model.

# 6. Model Training | تدريب النموذج # Define the full pipeline with the model model = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier( n_estimators=100, random_state=42, n_jobs=-1 )) ]) print("\n🚀 Training Model...") model.fit(X_train, y_train) print("✅ Training Complete!")

🎯 Random Forest Parameters:

  • n_estimators=100: Creates 100 decision trees (more trees = better accuracy, but slower)
  • random_state=42: Ensures reproducible results (same random seed every time)
  • n_jobs=-1: Uses all CPU cores for faster training

What Happens During Training:

  1. The preprocessor transforms the training data (impute, scale, encode)
  2. Random Forest builds 100 decision trees
  3. Each tree learns patterns from the data
  4. The trees vote together to make predictions

Training Time: Usually takes a few minutes depending on your computer's speed.

7️⃣ Loss Curve Visualization

We visualize the training process using a Neural Network to ensure the model is learning properly.

# We use a separate MLP model just to visualize the loss curve diag_model = MLPClassifier(hidden_layer_sizes=(32,), max_iter=50, random_state=42) diag_model.fit(X_train_transformed, y_train) # Plot the loss curve plt.plot(diag_model.loss_curve_, label='Training Loss', color='blue') plt.xlabel('Iteration') plt.ylabel('Loss') plt.title('Training Loss Curve') plt.legend() plt.show()

Why Visualize Loss?

  • Confirms Learning: If loss decreases over time, the model is learning
  • Detects Overfitting: If loss stops decreasing, we might need to stop training
  • Diagnostic Tool: Helps identify training problems early

Note: Random Forest doesn't have a loss curve (it's not iterative), so we use MLPClassifier just for visualization purposes.

8️⃣ Model Evaluation

We test the trained model on unseen test data to measure its performance.

# 8. Evaluation | التقييم y_pred = model.predict(X_test) # Classification Report print(classification_report(y_test, y_pred)) # Confusion Matrix ConfusionMatrixDisplay.from_predictions(y_test, y_pred) plt.show()

📊 Evaluation Metrics:

  • Accuracy: Overall percentage of correct predictions
  • Precision: Of all predicted risky loans, how many were actually risky?
  • Recall: Of all actual risky loans, how many did we catch?
  • F1-Score: Balance between precision and recall

Confusion Matrix:

A table showing:

  • True Positives: Correctly predicted risky loans
  • True Negatives: Correctly predicted safe loans
  • False Positives: Predicted risky but actually safe (Type I error)
  • False Negatives: Predicted safe but actually risky (Type II error)

Goal: Maximize True Positives and True Negatives, minimize False Positives and False Negatives.

9️⃣ Model Saving

Finally, we save the trained model so we can use it in our Flask web application.

import joblib # Save the complete pipeline (preprocessor + model) joblib.dump(model, 'credit_risk_model.pkl') print("تم حفظ الموديل! قم بتنزيل ملف credit_risk_model.pkl من القائمة الجانبية") print("Model saved! Download credit_risk_model.pkl from the sidebar")

💾 Why Save the Complete Pipeline?

We save the entire pipeline (preprocessor + classifier), not just the model, because:

  • The preprocessor is needed for every prediction
  • Ensures consistent preprocessing (same as training)
  • Simplifies deployment - just load and use!

File Size: The saved model is typically 20-50 MB (before compression).

Next Steps:

  1. Download the credit_risk_model.pkl file from Google Colab
  2. Place it in the same folder as your app.py file
  3. Use joblib.load() in your Flask app to load the model
  4. Start making predictions!

🔄 Complete Workflow Summary

  1. Import Libraries → Load all necessary tools
  2. Load Data → Download dataset from Kaggle
  3. Feature Engineering → Create interest_burden feature
  4. Data Cleaning → Remove outliers and check quality
  5. Data Splitting → Split into train/test with stratification
  6. Build Pipeline → Create preprocessing pipeline
  7. Train Model → Fit Random Forest on training data
  8. Visualize → Plot loss curve (diagnostic)
  9. Evaluate → Test on unseen data
  10. Save Model → Export for use in Flask app

🎯 Key Takeaways

  • Feature Engineering: Creating interest_burden improves model accuracy
  • Stratified Splitting: Essential for imbalanced datasets
  • Pipeline: Ensures consistent preprocessing during training and prediction
  • Random Forest: Powerful algorithm that works well on structured financial data
  • Evaluation: Always test on unseen data to measure real performance
  • Model Saving: Save the complete pipeline, not just the classifier

Ready to Explore?

Open the notebook on Google Colab to see the code in action and run it yourself!

🚀 Open Notebook on Google Colab