Tools & Dataset - Smart Credit Risk System

Tools Used

Building a Machine Learning web application requires various tools and technologies. Here's what we used for this project:

                Programming & Development
                
                        Python: The main programming language. Python is perfect for Machine Learning 
                        because it has excellent libraries and is easy to read and write.
                    
                        VS Code: Visual Studio Code is our code editor. It helps us write, debug, and 
                        organize our code with features like syntax highlighting and extensions.

                Web Framework & Server
                
                        Flask: A lightweight Python web framework. Flask makes it easy to create web 
                        applications and APIs. It's simple to learn but powerful enough for production use.
                    
                        Gunicorn: A production-ready web server for Python applications. While Flask 
                        has a built-in development server, Gunicorn is needed for deploying to production platforms 
                        like Render.

                Machine Learning Libraries
                
                        Scikit-learn: One of the most popular Machine Learning libraries in Python. 
                        It provides our Random Forest Classifier, data preprocessing tools (imputers, scalers, encoders), 
                        and pipeline functionality.
                    
                        Pandas: A powerful library for data manipulation and analysis. We use it to 
                        organize user input into DataFrames (like Excel spreadsheets) that our model can understand.
                    
                        Joblib: A library for saving and loading Python objects efficiently. We use it 
                        to save our trained model and load it in the Flask app without retraining.
                    
                        NumPy: A fundamental library for numerical computing. It's a dependency for 
                        scikit-learn and pandas, handling all the mathematical operations behind the scenes.

                Deployment & Version Control
                
                        GitHub: A platform for storing and sharing code. We use it to version control 
                        our project, track changes, and collaborate. It also integrates with deployment platforms.
                    
                        Render: A cloud platform for deploying web applications. Render automatically 
                        builds and deploys our Flask app from GitHub, making it accessible to users worldwide via a URL.

Technology Stack Summary

Python Flask Scikit-learn Pandas Joblib Gunicorn VS Code GitHub Render

Dataset

Our Smart Credit Risk System was trained on the Credit Risk Dataset from Kaggle. This dataset contains real-world loan application information and outcomes, making it perfect for training a credit risk prediction model.

Dataset Source

Platform: Kaggle
Dataset Name: Credit Risk Dataset
Type: Structured tabular data (CSV format)

Main Features in the Dataset

The dataset includes various features (columns) that describe loan applicants and their loan details:

Age: The applicant's age in years
Person Income: The applicant's annual income in dollars
Loan Amount: The total amount of money requested for the loan
Loan Interest Rate: The annual interest rate percentage for the loan
Loan Intent: The purpose of the loan (e.g., "Home Improvement", "Debt Consolidation", "Medical", "Personal", "Venture", "Education")
Employment Length: How long the person has been employed (in years)
Home Ownership: Whether the person owns a home, rents, or has a mortgage
Previous Default: Whether the person has defaulted on a loan before (Yes/No)
Credit History Length: How long the person's credit history is (in years)

Target Variable

The dataset also includes the loan_status column, which tells us whether each loan applicant actually defaulted (didn't pay back) or paid back successfully. This is what we're trying to predict for new applicants!

0: Loan was paid back successfully (SAFE)
1: Loan defaulted (RISKY)

Why This Dataset?

This dataset is ideal for our project because:

It contains real-world financial data, making our model practical and relevant
It has a good mix of numerical features (age, income) and categorical features (loan intent, home ownership)
It's large enough to train a robust model but manageable in size
It's publicly available and well-documented on Kaggle

📊 Data Preprocessing

Before training, the data goes through preprocessing steps:

Handling Missing Values: Some applications might have missing information, which we fill in
Scaling: Numerical features are normalized to the same scale
Encoding: Categorical features (like "Home Improvement") are converted to numbers
Feature Engineering: We calculate derived features like interest_burden

Project Architecture

Here's how all these tools work together:

Data Collection: Download Credit Risk Dataset from Kaggle
Model Training: Use Python, Scikit-learn, and Pandas to train Random Forest model
Model Saving: Use Joblib to save the trained model
Web Development: Use Flask to create the web application
Frontend: Create HTML/CSS interface for user interaction
Version Control: Store code on GitHub
Deployment: Deploy to Render using Gunicorn
Result: Live web application accessible to users!