train_model.py Explanation - Smart Credit Risk System

Overview

The train_model.py script is a complete, standalone training script that loads data, preprocesses it, trains a Random Forest model, and saves it. This script can be run independently to retrain the model on your local machine.

🎯 Purpose

Train a Random Forest Classifier model from scratch using the Credit Risk Dataset, with proper preprocessing, feature engineering, and handling of imbalanced data.

Complete Code

import pandas as pd
import joblib
import os

# استدعاء المكتبات
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...")

# 1. تحميل البيانات
df = pd.read_csv("credit_risk_dataset.csv")

# 2. تنظيف وهندسة الميزات
df = df[df['person_age'] <= 100] # حذف الشواذ
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0)

# 3. التقسيم
X = df.drop('loan_status', axis=1)
y = df['loan_status']

# 4. التجهيز (Pipeline)
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 5. التدريب
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42))
])

model.fit(X, y)

# 6. الحفظ
joblib.dump(model, 'credit_risk_model.pkl')

print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!")
print("الآن يمكنك تشغيل app.py بدون مشاكل.")
            

1️⃣ Library Imports

import pandas as pd
import joblib
import os

# استدعاء المكتبات
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
            

What Each Library Does:

pandas: Load and manipulate the dataset
joblib: Save the trained model to a file
os: Operating system interface (used implicitly)
sklearn: All Machine Learning tools (preprocessing, models, etc.)

2️⃣ Load Data

print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...")

# 1. تحميل البيانات
df = pd.read_csv("credit_risk_dataset.csv")
            

What This Does:

Print Message: Shows a user-friendly message in Arabic: "Loading data and training the model on your device..."
Load CSV: Reads the dataset from credit_risk_dataset.csv file
The file must be in the same directory as the script

                📊 Dataset Requirements:
                File name: credit_risk_dataset.csv
Location: Same folder as train_model.py
Format: CSV (Comma-Separated Values)
Expected columns: person_age, person_income, loan_status, etc.

            

3️⃣ Data Cleaning & Feature Engineering ⭐

# 2. تنظيف وهندسة الميزات
df = df[df['person_age'] <= 100] # حذف الشواذ

# الميزة الإضافية
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0)
            

Step 1: Remove Outliers

What This Does:

Filters out records where person_age > 100
These are considered outliers (impossible ages)
Outliers can hurt model performance

Why This Matters: Removing unrealistic data helps the model learn better patterns from valid examples.

Step 2: Feature Engineering - interest_burden ⭐

🔑 Creating interest_burden

Formula:

interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income

What It Represents:

Shows what percentage of annual income goes toward loan interest payments
Lower values = Lower financial burden = Lower risk
This single feature captures an important financial relationship

Example Calculation:

Loan amount: $10,000
Interest rate: 5%
Annual income: $50,000

Calculation:

Annual interest = $10,000 × (5 / 100) = $500
interest_burden = $500 / $50,000 = 0.01 (or 1%)

Step 3: Handle Missing Values

df['interest_burden'] = df['interest_burden'].fillna(0)

What This Does: Fills any missing (NaN) values in interest_burden with 0.

Why This Happens: If person_income is 0 or missing, division by zero would create NaN values. Setting them to 0 is a safe default.

4️⃣ Data Splitting

# 3. التقسيم
X = df.drop('loan_status', axis=1)
y = df['loan_status']
            

What This Does:

X (Features): All columns except loan_status - these are the input features
y (Target): The loan_status column - this is what we want to predict

💡 Note About Train/Test Split

This script trains on the entire dataset (no train/test split). The comment says: "نستخدم كامل البيانات أو التقسيم، هنا للسرعة دربنا وكأننا نجهز المنتج النهائي"

Translation: "We use all data or splitting, here for speed we trained as if preparing the final product"

Why: For production deployment, sometimes you train on all available data to maximize model performance.

5️⃣ Preprocessing Pipeline

We create separate preprocessing pipelines for numerical and categorical features.

Step 1: Identify Feature Types

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
            

What This Does:

numeric_features: Automatically finds all numeric columns (int64, float64)
categorical_features: Automatically finds all text/category columns (object, category)

Why This Matters: Different feature types need different preprocessing steps.

Step 2: Numerical Transformer Pipeline

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
            

For Numerical Features:

SimpleImputer (median): Fills missing values with the median (middle value) of that column
StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)

Why median instead of mean? Median is more robust to outliers than mean.

Step 3: Categorical Transformer Pipeline

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
            

For Categorical Features:

SimpleImputer (most_frequent): Fills missing values with the most common value (mode)
OneHotEncoder: Converts text categories into numbers (0s and 1s)
handle_unknown='ignore': If a new category appears during prediction, ignore it instead of erroring

Step 4: Combine Transformers

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])
            

🔧 ColumnTransformer Explained

What It Does: Applies different preprocessing to different column types:

Numeric columns → numeric_transformer (impute + scale)
Categorical columns → categorical_transformer (impute + encode)

Result: All features are processed and ready for the model in one step!

6️⃣ Model Training ⭐

# 5. التدريب
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100, 
        class_weight='balanced', 
        random_state=42
    ))
])

model.fit(X, y)
            

🎯 Complete Pipeline

The model is a Pipeline with two steps:

preprocessor: Handles all data preprocessing
classifier: Random Forest model that makes predictions

Why Use Pipeline? Ensures preprocessing is applied consistently during training AND prediction.

Random Forest Parameters:

n_estimators=100: Creates 100 decision trees. More trees = better accuracy, but slower training.
class_weight='balanced': ⭐ This is important! Automatically adjusts weights to handle imbalanced data.
- Our dataset has 78% Safe loans and 22% Risky loans
- 'balanced' gives more weight to the minority class (Risky)
- Helps the model learn to predict risky loans better
random_state=42: Ensures reproducible results (same random seed every time)

💡 class_weight='balanced' Explained

Problem: Imbalanced dataset (78% Safe, 22% Risky)

Without class_weight: Model might ignore the minority class and always predict "Safe"

With class_weight='balanced':

Automatically calculates weights: Safe = 0.64, Risky = 2.28
Model pays more attention to risky loans during training
Better at detecting risky applicants (which is what we want!)

The Training Process

When model.fit(X, y) is called:

Preprocessor transforms the data (impute, scale, encode)
Random Forest builds 100 decision trees
Each tree learns patterns from the data
Trees vote together to make predictions
class_weight ensures balanced learning

Training Time: Usually takes 1-5 minutes depending on your computer and dataset size.

7️⃣ Model Saving

# 6. الحفظ (هنا يتم حل المشكلة)
# سيتم حفظ الموديل بنسخة Scikit-Learn الموجودة في جهازك
joblib.dump(model, 'credit_risk_model.pkl')

print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!")
print("الآن يمكنك تشغيل app.py بدون مشاكل.")
            

What This Does:

joblib.dump(): Saves the complete pipeline (preprocessor + classifier) to a file
File Name: credit_risk_model.pkl
File Location: Same directory as the script
Success Message: Prints confirmation in Arabic

💾 Why Save the Complete Pipeline?

We save the entire pipeline (preprocessor + classifier), not just the model, because:

The preprocessor is needed for every prediction
Ensures consistent preprocessing (same as training)
Simplifies deployment - just load and use!
No need to remember preprocessing steps separately

📊 Expected Output:

⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...
✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!
الآن يمكنك تشغيل app.py بدون مشاكل.
                

🔄 Key Differences from Jupyter Notebook

1. No Train/Test Split

This script trains on the entire dataset, while the notebook uses train/test split for evaluation.

2. class_weight='balanced'

This script uses class_weight='balanced' to handle imbalanced data, which is important for production models.

3. Simpler Structure

This is a standalone script - no cells, no visualization, just training and saving.

4. Direct CSV Loading

Loads CSV directly from local file, not from Kaggle (unlike the notebook).

🚀 How to Use This Script

                Step-by-Step Instructions:
                
                        Prepare the dataset: Ensure credit_risk_dataset.csv is in the same folder as train_model.py
                    
Install dependencies: Make sure you have all required libraries:
                        
pip install pandas scikit-learn joblib
                        
Run the script:
python train_model.py
                        
                        Wait for completion: The script will:
                        Load and clean the data
Create the interest_burden feature
Build the preprocessing pipeline
Train the Random Forest model
Save the model to credit_risk_model.pkl

                        Use the model: The saved model can now be loaded in app.py

⚠️ Important Notes:

The script will overwrite any existing credit_risk_model.pkl file
Training takes several minutes depending on your computer
Make sure you have enough RAM (the dataset is ~32,000 rows)
After training, you can optionally run shrink_model.py to compress the file

🔄 Complete Training Workflow

                Load Data → Read CSV file
Clean Data → Remove outliers (age > 100)
Feature Engineering → Create interest_burden feature
Split Features/Target → Separate X (features) and y (target)
Identify Feature Types → Separate numeric and categorical columns
Build Preprocessing Pipelines → Create transformers for each type
Combine Transformers → Use ColumnTransformer
Create Model Pipeline → Combine preprocessor + classifier
Train Model → Fit on entire dataset
Save Model → Export to credit_risk_model.pkl

            

🎯 Key Takeaways

Standalone Script: Can be run independently to retrain the model
Feature Engineering: Creates interest_burden to improve predictions
Pipeline Approach: Ensures consistent preprocessing
class_weight='balanced': Handles imbalanced dataset automatically
Complete Pipeline: Saves preprocessor + model together
Production Ready: Trains on full dataset for maximum performance

📁 Related Files

This script is part of the complete training workflow:

Credit_Risk_Dataset.ipynb: Interactive notebook version with visualization
train_model.py: Standalone training script (this file)
shrink_model.py: Compresses the trained model
app.py: Loads and uses the trained model

                🔄 Complete Workflow:
                Train model using train_model.py → Creates credit_risk_model.pkl
(Optional) Compress using shrink_model.py → Reduces file size
Deploy with app.py → Loads model and makes predictions

            

train_model.py - Complete Training Script

Overview

🎯 Purpose

Complete Code

📚 Table of Contents

1️⃣ Library Imports

What Each Library Does:

2️⃣ Load Data

What This Does:

📊 Dataset Requirements:

3️⃣ Data Cleaning & Feature Engineering ⭐

Step 1: Remove Outliers

Step 2: Feature Engineering - interest_burden ⭐

🔑 Creating interest_burden

Example Calculation:

Step 3: Handle Missing Values

4️⃣ Data Splitting

What This Does:

💡 Note About Train/Test Split

5️⃣ Preprocessing Pipeline

Step 1: Identify Feature Types

Step 2: Numerical Transformer Pipeline

For Numerical Features:

Step 3: Categorical Transformer Pipeline

For Categorical Features:

Step 4: Combine Transformers

🔧 ColumnTransformer Explained

6️⃣ Model Training ⭐

🎯 Complete Pipeline

Random Forest Parameters:

💡 class_weight='balanced' Explained

The Training Process

7️⃣ Model Saving

What This Does:

💾 Why Save the Complete Pipeline?

📊 Expected Output:

🔄 Key Differences from Jupyter Notebook

1. No Train/Test Split

2. class_weight='balanced'

3. Simpler Structure

4. Direct CSV Loading

🚀 How to Use This Script

Step-by-Step Instructions:

⚠️ Important Notes:

🔄 Complete Training Workflow

🎯 Key Takeaways

📁 Related Files

🔄 Complete Workflow: