Tools & Dataset
Technologies, libraries, and data used to build the Smart Credit Risk System
Tools Used
Building a Machine Learning web application requires various tools and technologies. Here's what we used for this project:
Programming & Development
- Python: The main programming language. Python is perfect for Machine Learning because it has excellent libraries and is easy to read and write.
- VS Code: Visual Studio Code is our code editor. It helps us write, debug, and organize our code with features like syntax highlighting and extensions.
Web Framework & Server
- Flask: A lightweight Python web framework. Flask makes it easy to create web applications and APIs. It's simple to learn but powerful enough for production use.
- Gunicorn: A production-ready web server for Python applications. While Flask has a built-in development server, Gunicorn is needed for deploying to production platforms like Render.
Machine Learning Libraries
- Scikit-learn: One of the most popular Machine Learning libraries in Python. It provides our Random Forest Classifier, data preprocessing tools (imputers, scalers, encoders), and pipeline functionality.
- Pandas: A powerful library for data manipulation and analysis. We use it to organize user input into DataFrames (like Excel spreadsheets) that our model can understand.
- Joblib: A library for saving and loading Python objects efficiently. We use it to save our trained model and load it in the Flask app without retraining.
- NumPy: A fundamental library for numerical computing. It's a dependency for scikit-learn and pandas, handling all the mathematical operations behind the scenes.
Deployment & Version Control
- GitHub: A platform for storing and sharing code. We use it to version control our project, track changes, and collaborate. It also integrates with deployment platforms.
- Render: A cloud platform for deploying web applications. Render automatically builds and deploys our Flask app from GitHub, making it accessible to users worldwide via a URL.
Technology Stack Summary
Dataset
Our Smart Credit Risk System was trained on the Credit Risk Dataset from Kaggle. This dataset contains real-world loan application information and outcomes, making it perfect for training a credit risk prediction model.
Dataset Source
Platform: Kaggle
Dataset Name: Credit Risk Dataset
Type: Structured tabular data (CSV format)
Main Features in the Dataset
The dataset includes various features (columns) that describe loan applicants and their loan details:
- Age: The applicant's age in years
- Person Income: The applicant's annual income in dollars
- Loan Amount: The total amount of money requested for the loan
- Loan Interest Rate: The annual interest rate percentage for the loan
- Loan Intent: The purpose of the loan (e.g., "Home Improvement", "Debt Consolidation", "Medical", "Personal", "Venture", "Education")
- Employment Length: How long the person has been employed (in years)
- Home Ownership: Whether the person owns a home, rents, or has a mortgage
- Previous Default: Whether the person has defaulted on a loan before (Yes/No)
- Credit History Length: How long the person's credit history is (in years)
Target Variable
The dataset also includes the loan_status column, which tells us whether each loan applicant actually defaulted (didn't pay back) or paid back successfully. This is what we're trying to predict for new applicants!
- 0: Loan was paid back successfully (SAFE)
- 1: Loan defaulted (RISKY)
Why This Dataset?
This dataset is ideal for our project because:
- It contains real-world financial data, making our model practical and relevant
- It has a good mix of numerical features (age, income) and categorical features (loan intent, home ownership)
- It's large enough to train a robust model but manageable in size
- It's publicly available and well-documented on Kaggle
📊 Data Preprocessing
Before training, the data goes through preprocessing steps:
- Handling Missing Values: Some applications might have missing information, which we fill in
- Scaling: Numerical features are normalized to the same scale
- Encoding: Categorical features (like "Home Improvement") are converted to numbers
- Feature Engineering: We calculate derived features like interest_burden
Project Architecture
Here's how all these tools work together:
- Data Collection: Download Credit Risk Dataset from Kaggle
- Model Training: Use Python, Scikit-learn, and Pandas to train Random Forest model
- Model Saving: Use Joblib to save the trained model
- Web Development: Use Flask to create the web application
- Frontend: Create HTML/CSS interface for user interaction
- Version Control: Store code on GitHub
- Deployment: Deploy to Render using Gunicorn
- Result: Live web application accessible to users!