Classification of Loan Defaulters using Machine Learning Techniques


By: Vikas Dayananda, Sharath Kumar, Varun Rao
Click here to view project code on GitHub.

Meet the team


Vikas Dayananda
Sharath Kumar
Varun Rao
Graduate Students, Department of Computer Science
University of North Carolina at Charlotte
Charlotte, NC USA.

Introduction

The idea behind this project is to build models for Loan Data provided by Lending Club that summaries a large amount of it's customers data to give an overview of the trend of default status among customers who have taken a loan there. We are building a Naive Bayes and a K-Nearest Neighbor Classification model to classify the customers into "Defaulted" and "Not Defaulted" status.


Motivation

Lending Club is the world's largest online marketplace connecting borrowers and investors. Lending Club helps to make credit more affordable and investing more rewarding. They operate at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid returns.

Any person who wants to provide loan for some interest can advertize to provide loan at a particular interest rate. Similarly, anyone who intends to borrow loan would advertize to get loan at his desired interest rate. Lending club tries to match these loan investors and loan borrowers.

The idea of being able to predict whether a customer would default on his loan or not, even before he is sanctioned a loan is an exciting one. This approach is also of utmost importance to any financial institution. Although, the models we build might not be perfect, it will help the institution and give them an idea of what to expect from a customer. The data we have collected might be from Lending Club, but this solution can be applied to any financial service, which makes this project even more useful and important.

The direct applicability of this model to real world problems in banking and financial firms makes this an interesting and cool project to work on. Also, so far we have only imported libraries to build models in R, Python and hence implementing from scratch will be a challenge.


Dataset

Dataset Collection

We used the loan data provided by Lending Club on Kaggle website. The files that we downloaded contain complete loan data for all loans issued to the customers through the 2007-2015, including the current loan status which is our predictor variable. The file is in the csv format and has a size of about 1 GB. It is a matrix of about 6 lakh observations and 75 variables. The data dictionary was provided along with the data which helped us understand the different variables involved. As every column is not necessary for the prediction, we chose only few of the most important numerical columns to predict the target variable. As mentioned above, our predictor variable is "loan_status" which has the details of open, currently running and closed loans indicating whether a person has defaulted on the loan taken or not. And our purpose is to build models to accurately predict this classification.

Dataset Preparation

Not all the variables are used for our prediction.We have used about 7 variables which will be explained later All the observations with missing values have been deleted.

Data is divided into training data (80%) and test data(20%). Training data is used to build the model and Test data is used to validate the model. Sample data is shown below


Methods

The main task involved in this project are a follows:

  1. Data Cleaning
  2. Missing value imputations
  3. Split the data into training and test sets.
  4. Model building
  5. Train the model based on the training set.
  6. Predict loan status for the entire test set.
  7. Calculate the model accuracy.

Naive Bayes Classifier

Naive Bayes classification is a supervised machine learning classifier which works on the principle of Bayes Theorem. The Naive Bayes algorithm is called "naive" because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features. We find the probabilities of each feature values for a class and multiply all the probabilities. This product is multiplied with the probability of the class. Which class gives a higher probability that class would be given to the corresponding data.

Naive bayes formula


The terminology in the Bayesian method of probability is as follows:

This sums the Bayes' theorem as:



Naive bayes formula

Naive Bayes Classifier - Algorithm

Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class.

Training Phase:

Given a training set S of F features and L classes,
For each target value Ci (C1, C2, C3,..., CL)
P(Ci) = estimate P(ci) with examples in S;
For every feature value xjk of each feature xj (j=1,..., F; k=1,...,N)
P(xj = xjk|ci) = estimate P(xjk|ci) with examples in S;
Output F*L gives the conditional probabilistic models.

Test Phase:

Given an unknown instance x'=(a'1,..., a'n) "Look up table" to assign label c* for x' if:

Naive bayes formula

Features Used from the Data Set:

We selected only important variables as features for Naive Bayes classifier. Those features are described below:

Attributes Description
emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than a year and 10 means 10 or more years
grade Loan grade: A,B,C,D
home_ownership The home ownership status provided by borrower during registration or obtained from credit report. It can be : RENT, OWN, MORTGAGE, OTHER
initial_list_status The initial listing status of the loan. It can be : W or F
term The number of payments of the loan. Values are in months and can be either 36 or 60.
verification_status Indicate's if income is verified or not verified by LC.

Sample of the data used for Naive Bayes classification is shown below

k- Nearest Neighbour Classifier

The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for other more complex classifiers. Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications such as economic forecasting, data compression and genetics. For example, KNN was leveraged in a 2006 study of functional genomics for the assignment of genes based on their expression profiles.

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) where x denotes a feature and y denotes the target variable we are trying to predict.

We would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output.

The KNN classifier is also a non parametric and instance-based learning algorithm.

  • Non-parametric means it makes no explicit assumptions about the functional form of h, avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.
  • Instance-based learning means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances which are subsequently used as "knowledge" for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.

k- Nearest Neighbour Algorithm

  1. Handle the training data: Read in all the rows from the training set.
  2. Determine Similarity: Calculate the Euclidean distance between a test instance and all the training instances. The distance is calculated to locate the 'k' most similar instances in the training set for a given member of the test set. (We choose the value of k to be 3,4 and 5 in our project). Calculate the distances only for the independent variables, ignoring the target variable.
  3. Selecting Neighbors: After calculating the distances, select a subset (k=3,4,5) with the smallest distance values.
  4. Devise a predicted response based on the neighbors: Take the simple majority of the category of nearest neighbors as the prediction value of the query instance.

Features Used from the Data Set:

We selected numerical variables only as features for K-Nearest Neighbor classifier as it can handle numerical variables well. We selected below features as inputs to our K-Nearest Neighbor model:

Features Description
Int_rate Interest Rate on the loan (Numeric)
Loan_amt Amount of loan taken by the customer(Numeric)
Annual_inc Annual income of the customer(Numeric)
Open_acc Open credit lines in the customer's account(Numeric)
Revol_bal Credit revoke balance
Total_acc Total number of credit accounts under the customer's name
Revol_util Amount of credit the customer is using relative to all relative accounts
Delinq_amnt Past-due amount owned on the accounts
Dti Ratio of the customer's total monthly debt payments divided by his/her self-reported monthly income

Sample of the data used for KNN classification is shown below.

How does it work?

Let's take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The "K" is KNN algorithm is the nearest neighbors we wish to take vote from. Let's say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.

In short, A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

We use one of the following distance functions to calculate the distance between two points.


How do you choose K?

The K in KNN is a hyper-parameter that you, as a designer, must pick in order to get the best possible fit for the data set. Intuitively, you can think of K as controlling the shape of the decision boundary we talked about earlier.

When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind" to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged.

On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.

In general, a large K value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN.


Execution


Evaluation

Naive Bayes

K- Nearest Neighbour

Comparison

Naive Bayes

78%

k-NN

75%


Conclusion

We successfully implemented all the tasks that we had mentioned at the beginning of the project under the "definitely achieve" section. We built a Naive Bayes and a k-Nearest Neighbor model to classify the loan_status in our data. Our models gave decent accuracies of around 75-80%. Naive Bayes models built using Mllib also gave accuracy in the similar range but a bit higher than our model. Although, we are not entirely sure why there is a difference, but we cannot generalize it and say that Mllib works better than our model for every data. As part of the "ideally achieve" section, we were able to build Logistic Regression just using the Python programming language. As we tested it on a smaller dataset we were able to achieve an accuracy of around 70%. This project helped us learn several new things about the Python language, the Spark architecture and helped us gain a better understanding of the Naive Bayes and kNN algorithms.


References

http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
http://www.saedsayad.com/k_nearest_neighbors.htm
https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/
https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
.