train_model.py - Complete Training Script
A standalone Python script to train the Random Forest model from scratch
Overview
The train_model.py script is a complete, standalone training script that loads data,
preprocesses it, trains a Random Forest model, and saves it. This script can be run independently
to retrain the model on your local machine.
🎯 Purpose
Train a Random Forest Classifier model from scratch using the Credit Risk Dataset, with proper preprocessing, feature engineering, and handling of imbalanced data.
Complete Code
import pandas as pd
import joblib
import os
# استدعاء المكتبات
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...")
# 1. تحميل البيانات
df = pd.read_csv("credit_risk_dataset.csv")
# 2. تنظيف وهندسة الميزات
df = df[df['person_age'] <= 100] # حذف الشواذ
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0)
# 3. التقسيم
X = df.drop('loan_status', axis=1)
y = df['loan_status']
# 4. التجهيز (Pipeline)
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# 5. التدريب
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42))
])
model.fit(X, y)
# 6. الحفظ
joblib.dump(model, 'credit_risk_model.pkl')
print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!")
print("الآن يمكنك تشغيل app.py بدون مشاكل.")
📚 Table of Contents
1️⃣ Library Imports
import pandas as pd
import joblib
import os
# استدعاء المكتبات
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
What Each Library Does:
- pandas: Load and manipulate the dataset
- joblib: Save the trained model to a file
- os: Operating system interface (used implicitly)
- sklearn: All Machine Learning tools (preprocessing, models, etc.)
2️⃣ Load Data
print("⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...")
# 1. تحميل البيانات
df = pd.read_csv("credit_risk_dataset.csv")
What This Does:
- Print Message: Shows a user-friendly message in Arabic: "Loading data and training the model on your device..."
- Load CSV: Reads the dataset from
credit_risk_dataset.csvfile - The file must be in the same directory as the script
📊 Dataset Requirements:
- File name:
credit_risk_dataset.csv - Location: Same folder as
train_model.py - Format: CSV (Comma-Separated Values)
- Expected columns: person_age, person_income, loan_status, etc.
3️⃣ Data Cleaning & Feature Engineering ⭐
# 2. تنظيف وهندسة الميزات
df = df[df['person_age'] <= 100] # حذف الشواذ
# الميزة الإضافية
df['interest_burden'] = (df['loan_amnt'] * (df['loan_int_rate'] / 100)) / df['person_income']
df['interest_burden'] = df['interest_burden'].fillna(0)
Step 1: Remove Outliers
What This Does:
- Filters out records where
person_age > 100 - These are considered outliers (impossible ages)
- Outliers can hurt model performance
Why This Matters: Removing unrealistic data helps the model learn better patterns from valid examples.
Step 2: Feature Engineering - interest_burden ⭐
🔑 Creating interest_burden
Formula:
interest_burden = (loan_amnt × (loan_int_rate / 100)) / person_income
What It Represents:
- Shows what percentage of annual income goes toward loan interest payments
- Lower values = Lower financial burden = Lower risk
- This single feature captures an important financial relationship
Example Calculation:
- Loan amount: $10,000
- Interest rate: 5%
- Annual income: $50,000
Calculation:
- Annual interest = $10,000 × (5 / 100) = $500
- interest_burden = $500 / $50,000 = 0.01 (or 1%)
Step 3: Handle Missing Values
df['interest_burden'] = df['interest_burden'].fillna(0)
What This Does: Fills any missing (NaN) values in interest_burden with 0.
Why This Happens: If person_income is 0 or missing, division by zero would create NaN values. Setting them to 0 is a safe default.
4️⃣ Data Splitting
# 3. التقسيم
X = df.drop('loan_status', axis=1)
y = df['loan_status']
What This Does:
- X (Features): All columns except
loan_status- these are the input features - y (Target): The
loan_statuscolumn - this is what we want to predict
💡 Note About Train/Test Split
This script trains on the entire dataset (no train/test split). The comment says: "نستخدم كامل البيانات أو التقسيم، هنا للسرعة دربنا وكأننا نجهز المنتج النهائي"
Translation: "We use all data or splitting, here for speed we trained as if preparing the final product"
Why: For production deployment, sometimes you train on all available data to maximize model performance.
5️⃣ Preprocessing Pipeline
We create separate preprocessing pipelines for numerical and categorical features.
Step 1: Identify Feature Types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
What This Does:
- numeric_features: Automatically finds all numeric columns (int64, float64)
- categorical_features: Automatically finds all text/category columns (object, category)
Why This Matters: Different feature types need different preprocessing steps.
Step 2: Numerical Transformer Pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
For Numerical Features:
- SimpleImputer (median): Fills missing values with the median (middle value) of that column
- StandardScaler: Normalizes values to have mean=0 and std=1 (puts everything on the same scale)
Why median instead of mean? Median is more robust to outliers than mean.
Step 3: Categorical Transformer Pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
For Categorical Features:
- SimpleImputer (most_frequent): Fills missing values with the most common value (mode)
- OneHotEncoder: Converts text categories into numbers (0s and 1s)
- handle_unknown='ignore': If a new category appears during prediction, ignore it instead of erroring
Step 4: Combine Transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
🔧 ColumnTransformer Explained
What It Does: Applies different preprocessing to different column types:
- Numeric columns → numeric_transformer (impute + scale)
- Categorical columns → categorical_transformer (impute + encode)
Result: All features are processed and ready for the model in one step!
6️⃣ Model Training ⭐
# 5. التدريب
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=100,
class_weight='balanced',
random_state=42
))
])
model.fit(X, y)
🎯 Complete Pipeline
The model is a Pipeline with two steps:
- preprocessor: Handles all data preprocessing
- classifier: Random Forest model that makes predictions
Why Use Pipeline? Ensures preprocessing is applied consistently during training AND prediction.
Random Forest Parameters:
- n_estimators=100: Creates 100 decision trees. More trees = better accuracy, but slower training.
-
class_weight='balanced': ⭐ This is important! Automatically adjusts weights to handle imbalanced data.
- Our dataset has 78% Safe loans and 22% Risky loans
- 'balanced' gives more weight to the minority class (Risky)
- Helps the model learn to predict risky loans better
- random_state=42: Ensures reproducible results (same random seed every time)
💡 class_weight='balanced' Explained
Problem: Imbalanced dataset (78% Safe, 22% Risky)
Without class_weight: Model might ignore the minority class and always predict "Safe"
With class_weight='balanced':
- Automatically calculates weights: Safe = 0.64, Risky = 2.28
- Model pays more attention to risky loans during training
- Better at detecting risky applicants (which is what we want!)
The Training Process
When model.fit(X, y) is called:
- Preprocessor transforms the data (impute, scale, encode)
- Random Forest builds 100 decision trees
- Each tree learns patterns from the data
- Trees vote together to make predictions
- class_weight ensures balanced learning
Training Time: Usually takes 1-5 minutes depending on your computer and dataset size.
7️⃣ Model Saving
# 6. الحفظ (هنا يتم حل المشكلة)
# سيتم حفظ الموديل بنسخة Scikit-Learn الموجودة في جهازك
joblib.dump(model, 'credit_risk_model.pkl')
print("✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!")
print("الآن يمكنك تشغيل app.py بدون مشاكل.")
What This Does:
- joblib.dump(): Saves the complete pipeline (preprocessor + classifier) to a file
- File Name:
credit_risk_model.pkl - File Location: Same directory as the script
- Success Message: Prints confirmation in Arabic
💾 Why Save the Complete Pipeline?
We save the entire pipeline (preprocessor + classifier), not just the model, because:
- The preprocessor is needed for every prediction
- Ensures consistent preprocessing (same as training)
- Simplifies deployment - just load and use!
- No need to remember preprocessing steps separately
📊 Expected Output:
⏳ جاري تحميل البيانات وتدريب الموديل على جهازك...
✅ تم إعادة التدريب وحفظ ملف 'credit_risk_model.pkl' الجديد بنجاح!
الآن يمكنك تشغيل app.py بدون مشاكل.
🔄 Key Differences from Jupyter Notebook
1. No Train/Test Split
This script trains on the entire dataset, while the notebook uses train/test split for evaluation.
2. class_weight='balanced'
This script uses class_weight='balanced' to handle imbalanced data, which is important for production models.
3. Simpler Structure
This is a standalone script - no cells, no visualization, just training and saving.
4. Direct CSV Loading
Loads CSV directly from local file, not from Kaggle (unlike the notebook).
🚀 How to Use This Script
Step-by-Step Instructions:
-
Prepare the dataset: Ensure
credit_risk_dataset.csvis in the same folder astrain_model.py -
Install dependencies: Make sure you have all required libraries:
pip install pandas scikit-learn joblib -
Run the script:
python train_model.py -
Wait for completion: The script will:
- Load and clean the data
- Create the interest_burden feature
- Build the preprocessing pipeline
- Train the Random Forest model
- Save the model to
credit_risk_model.pkl
-
Use the model: The saved model can now be loaded in
app.py
⚠️ Important Notes:
- The script will overwrite any existing
credit_risk_model.pklfile - Training takes several minutes depending on your computer
- Make sure you have enough RAM (the dataset is ~32,000 rows)
- After training, you can optionally run
shrink_model.pyto compress the file
🔄 Complete Training Workflow
- Load Data → Read CSV file
- Clean Data → Remove outliers (age > 100)
- Feature Engineering → Create
interest_burdenfeature - Split Features/Target → Separate X (features) and y (target)
- Identify Feature Types → Separate numeric and categorical columns
- Build Preprocessing Pipelines → Create transformers for each type
- Combine Transformers → Use ColumnTransformer
- Create Model Pipeline → Combine preprocessor + classifier
- Train Model → Fit on entire dataset
- Save Model → Export to
credit_risk_model.pkl
🎯 Key Takeaways
- Standalone Script: Can be run independently to retrain the model
- Feature Engineering: Creates
interest_burdento improve predictions - Pipeline Approach: Ensures consistent preprocessing
- class_weight='balanced': Handles imbalanced dataset automatically
- Complete Pipeline: Saves preprocessor + model together
- Production Ready: Trains on full dataset for maximum performance
📁 Related Files
This script is part of the complete training workflow:
- Credit_Risk_Dataset.ipynb: Interactive notebook version with visualization
- train_model.py: Standalone training script (this file)
- shrink_model.py: Compresses the trained model
- app.py: Loads and uses the trained model
🔄 Complete Workflow:
- Train model using
train_model.py→ Createscredit_risk_model.pkl - (Optional) Compress using
shrink_model.py→ Reduces file size - Deploy with
app.py→ Loads model and makes predictions