Overview

The train_model.py script is a complete, standalone training script that loads data, preprocesses it, trains a Random Forest model, and saves it. This script can be run independently to retrain the model on your local machine.

🎯 Purpose

Train a Random Forest Classifier model from scratch using the Credit Risk Dataset, with proper preprocessing, feature engineering, and handling of imbalanced data.

Complete Code

import pandas as pd import joblib import os # استدعاء المكتبات from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...") # 1. تحميل البيانات df = pd.read_csv("credit_risk_dataset.csv") # 2. تنظيف وهندسة الميزات df = df[df['person_age'] <= 100] # حذف الشواذ df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income'] df['interest_burden'] = df['interest_burden'].fillna(0) # 3. التقسيم X = df.drop('loan_status', axis=1) y = df['loan_status'] # 4. التجهيز (Pipeline) numeric_features = X.select_dtypes(include=['int64', 'float64']).columns categorical_features = X.select_dtypes(include=['object', 'category']).columns numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 5. التدريب model = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)) ]) model.fit(X, y) # 6. الحفظ joblib.dump(model, 'credit_risk_model.pkl') print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!") print("الآن يمكنك تشغيل app.py بدون مشاكل.")

📚 Table of Contents

1️⃣ Library Imports

import pandas as pd import joblib import os # استدعاء المكتبات from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier

What Each Library Does:

  • pandas: Load and manipulate the dataset
  • joblib: Save the trained model to a file
  • os: Operating system interface (used implicitly)
  • sklearn: All Machine Learning tools (preprocessing, models, etc.)

2️⃣ Load Data

print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...") # 1. تحميل البيانات df = pd.read_csv("credit_risk_dataset.csv")

What This Does:

  • Print Message: Shows a user-friendly message in Arabic: "Loading data and training the model on your device..."
  • Load CSV: Reads the dataset from credit_risk_dataset.csv file
  • The file must be in the same directory as the script

📊 Dataset Requirements:

  • File name: credit_risk_dataset.csv
  • Location: Same folder as train_model.py
  • Format: CSV (Comma-Separated Values)
  • Expected columns: person_age, person_income, loan_status, etc.

3️⃣ Data Cleaning & Feature Engineering ⭐

# 2. تنظيف وهندسة الميزات df = df[df['person_age'] <= 100] # حذف الشواذ # الميزة الإضافية df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income'] df['interest_burden'] = df['interest_burden'].fillna(0)

Step 1: Remove Outliers

What This Does:

  • Filters out records where person_age > 100
  • These are considered outliers (impossible ages)
  • Outliers can hurt model performance

Why This Matters: Removing unrealistic data helps the model learn better patterns from valid examples.

Step 2: Feature Engineering - interest_burden ⭐

🔑 Creating interest_burden

Formula:

interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income

What It Represents:

  • Shows what percentage of annual income goes toward loan interest payments
  • Lower values = Lower financial burden = Lower risk
  • This single feature captures an important financial relationship

Example Calculation:

  • Loan amount: $10,000
  • Interest rate: 5%
  • Annual income: $50,000

Calculation:

  • Annual interest = $10,000 × (5 / 100) = $500
  • interest_burden = $500 / $50,000 = 0.01 (or 1%)

Step 3: Handle Missing Values

df['interest_burden'] = df['interest_burden'].fillna(0)

What This Does: Fills any missing (NaN) values in interest_burden with 0.

Why This Happens: If person_income is 0 or missing, division by zero would create NaN values. Setting them to 0 is a safe default.

4️⃣ Data Splitting

# 3. التقسيم X = df.drop('loan_status', axis=1) y = df['loan_status']

What This Does:

  • X (Features): All columns except loan_status - these are the input features
  • y (Target): The loan_status column - this is what we want to predict

💡 Note About Train/Test Split

This script trains on the entire dataset (no train/test split). The comment says: "نستخدم كامل البيانات أو التقسيم، هنا للسرعة دربنا وكأننا نجهز المنتج النهائي"

Translation: "We use all data or splitting, here for speed we trained as if preparing the final product"

Why: For production deployment, sometimes you train on all available data to maximize model performance.

5️⃣ Preprocessing Pipeline

We create separate preprocessing pipelines for numerical and categorical features.

Step 1: Identify Feature Types

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns categorical_features = X.select_dtypes(include=['object', 'category']).columns

What This Does:

  • numeric_features: Automatically finds all numeric columns (int64, float64)
  • categorical_features: Automatically finds all text/category columns (object, category)

Why This Matters: Different feature types need different preprocessing steps.

Step 2: Numerical Transformer Pipeline

numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])

For Numerical Features:

  1. SimpleImputer (median): Fills missing values with the median (middle value) of that column
  2. StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)

Why median instead of mean? Median is more robust to outliers than mean.

Step 3: Categorical Transformer Pipeline

categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ])

For Categorical Features:

  1. SimpleImputer (most_frequent): Fills missing values with the most common value (mode)
  2. OneHotEncoder: Converts text categories into numbers (0s and 1s)
  3. handle_unknown='ignore': If a new category appears during prediction, ignore it instead of erroring

Step 4: Combine Transformers

preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ])

🔧 ColumnTransformer Explained

What It Does: Applies different preprocessing to different column types:

  • Numeric columns → numeric_transformer (impute + scale)
  • Categorical columns → categorical_transformer (impute + encode)

Result: All features are processed and ready for the model in one step!

6️⃣ Model Training ⭐

# 5. التدريب model = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier( n_estimators=100, class_weight='balanced', random_state=42 )) ]) model.fit(X, y)

🎯 Complete Pipeline

The model is a Pipeline with two steps:

  1. preprocessor: Handles all data preprocessing
  2. classifier: Random Forest model that makes predictions

Why Use Pipeline? Ensures preprocessing is applied consistently during training AND prediction.

Random Forest Parameters:

  • n_estimators=100: Creates 100 decision trees. More trees = better accuracy, but slower training.
  • class_weight='balanced':This is important! Automatically adjusts weights to handle imbalanced data.
    • Our dataset has 78% Safe loans and 22% Risky loans
    • 'balanced' gives more weight to the minority class (Risky)
    • Helps the model learn to predict risky loans better
  • random_state=42: Ensures reproducible results (same random seed every time)

💡 class_weight='balanced' Explained

Problem: Imbalanced dataset (78% Safe, 22% Risky)

Without class_weight: Model might ignore the minority class and always predict "Safe"

With class_weight='balanced':

  • Automatically calculates weights: Safe = 0.64, Risky = 2.28
  • Model pays more attention to risky loans during training
  • Better at detecting risky applicants (which is what we want!)

The Training Process

When model.fit(X, y) is called:

  1. Preprocessor transforms the data (impute, scale, encode)
  2. Random Forest builds 100 decision trees
  3. Each tree learns patterns from the data
  4. Trees vote together to make predictions
  5. class_weight ensures balanced learning

Training Time: Usually takes 1-5 minutes depending on your computer and dataset size.

7️⃣ Model Saving

# 6. الحفظ (هنا يتم حل المشكلة) # سيتم حفظ الموديل بنسخة Scikit-Learn الموجودة في جهازك joblib.dump(model, 'credit_risk_model.pkl') print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!") print("الآن يمكنك تشغيل app.py بدون مشاكل.")

What This Does:

  • joblib.dump(): Saves the complete pipeline (preprocessor + classifier) to a file
  • File Name: credit_risk_model.pkl
  • File Location: Same directory as the script
  • Success Message: Prints confirmation in Arabic

💾 Why Save the Complete Pipeline?

We save the entire pipeline (preprocessor + classifier), not just the model, because:

  • The preprocessor is needed for every prediction
  • Ensures consistent preprocessing (same as training)
  • Simplifies deployment - just load and use!
  • No need to remember preprocessing steps separately

📊 Expected Output:

⏳ جاري تحميل البيانات وتدريب الموديل على جهازك... ✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح! الآن يمكنك تشغيل app.py بدون مشاكل.

🔄 Key Differences from Jupyter Notebook

1. No Train/Test Split

This script trains on the entire dataset, while the notebook uses train/test split for evaluation.

2. class_weight='balanced'

This script uses class_weight='balanced' to handle imbalanced data, which is important for production models.

3. Simpler Structure

This is a standalone script - no cells, no visualization, just training and saving.

4. Direct CSV Loading

Loads CSV directly from local file, not from Kaggle (unlike the notebook).

🚀 How to Use This Script

Step-by-Step Instructions:

  1. Prepare the dataset: Ensure credit_risk_dataset.csv is in the same folder as train_model.py
  2. Install dependencies: Make sure you have all required libraries:
    pip install pandas scikit-learn joblib
  3. Run the script:
    python train_model.py
  4. Wait for completion: The script will:
    • Load and clean the data
    • Create the interest_burden feature
    • Build the preprocessing pipeline
    • Train the Random Forest model
    • Save the model to credit_risk_model.pkl
  5. Use the model: The saved model can now be loaded in app.py

⚠️ Important Notes:

  • The script will overwrite any existing credit_risk_model.pkl file
  • Training takes several minutes depending on your computer
  • Make sure you have enough RAM (the dataset is ~32,000 rows)
  • After training, you can optionally run shrink_model.py to compress the file

🔄 Complete Training Workflow

  1. Load Data → Read CSV file
  2. Clean Data → Remove outliers (age > 100)
  3. Feature Engineering → Create interest_burden feature
  4. Split Features/Target → Separate X (features) and y (target)
  5. Identify Feature Types → Separate numeric and categorical columns
  6. Build Preprocessing Pipelines → Create transformers for each type
  7. Combine Transformers → Use ColumnTransformer
  8. Create Model Pipeline → Combine preprocessor + classifier
  9. Train Model → Fit on entire dataset
  10. Save Model → Export to credit_risk_model.pkl

🎯 Key Takeaways

  • Standalone Script: Can be run independently to retrain the model
  • Feature Engineering: Creates interest_burden to improve predictions
  • Pipeline Approach: Ensures consistent preprocessing
  • class_weight='balanced': Handles imbalanced dataset automatically
  • Complete Pipeline: Saves preprocessor + model together
  • Production Ready: Trains on full dataset for maximum performance

📁 Related Files

This script is part of the complete training workflow:

🔄 Complete Workflow:

  1. Train model using train_model.py → Creates credit_risk_model.pkl
  2. (Optional) Compress using shrink_model.py → Reduces file size
  3. Deploy with app.py → Loads model and makes predictions