Meet the team
The idea behind this project is to build models for Loan Data provided by
Lending Club that summaries a large amount of it's customers data to give an
overview of the trend of default status among customers who have taken a loan there.
We are building a Naive Bayes and a K-Nearest Neighbor Classification model to classify the
customers into "Defaulted" and "Not Defaulted" status.
Lending Club is the world's largest online marketplace connecting borrowers and investors. Lending Club helps to make credit more affordable and investing more rewarding. They operate at a lower cost than traditional bank lending programs and pass the savings on to borrowers in the form of lower rates and to investors in the form of solid returns.
Any person who wants to provide loan for some interest can advertize to provide loan at a particular interest rate. Similarly, anyone who intends to borrow loan would advertize to get loan at his desired interest rate. Lending club tries to match these loan investors and loan borrowers.
The idea of being able to predict whether a customer would default on his loan or not, even before he is sanctioned a loan is an exciting one. This approach is also of utmost importance to any financial institution. Although, the models we build might not be perfect, it will help the institution and give them an idea of what to expect from a customer. The data we have collected might be from Lending Club, but this solution can be applied to any financial service, which makes this project even more useful and important.
The direct applicability of this model to real world problems in banking and financial firms makes this an interesting and cool project to work on. Also, so far we have only imported libraries to build models in R, Python and hence implementing from scratch will be a challenge.
We used the loan data provided by Lending Club on Kaggle website. The files that we downloaded contain complete loan data for all loans issued to the customers through the 2007-2015, including the current loan status which is our predictor variable. The file is in the csv format and has a size of about 1 GB. It is a matrix of about 6 lakh observations and 75 variables. The data dictionary was provided along with the data which helped us understand the different variables involved. As every column is not necessary for the prediction, we chose only few of the most important numerical columns to predict the target variable. As mentioned above, our predictor variable is "loan_status" which has the details of open, currently running and closed loans indicating whether a person has defaulted on the loan taken or not. And our purpose is to build models to accurately predict this classification.
Not all the variables are used for our prediction.We have used about 7 variables which will be explained later All the observations with missing values have been deleted.
Data is divided into training data (80%) and test data(20%). Training data is used to build the model and Test data is used to validate the model. Sample data is shown below
The main task involved in this project are a follows:
Naive Bayes classification is a supervised machine learning classifier which works on the principle of Bayes Theorem. The Naive Bayes algorithm is called "naive" because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features. We find the probabilities of each feature values for a class and multiply all the probabilities. This product is multiplied with the probability of the class. Which class gives a higher probability that class would be given to the corresponding data.
This sums the Bayes' theorem as:
Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class.
Training Phase:Given a training set S of F features and L classes,
For each target value Ci (C1, C2, C3,..., CL)
P(Ci) = estimate P(ci) with examples in S;
For every feature value xjk of each feature xj (j=1,..., F; k=1,...,N)
P(xj = xjk|ci) = estimate P(xjk|ci) with examples in S;
Output F*L gives the conditional probabilistic models.
Given an unknown instance x'=(a'1,..., a'n) "Look up table" to assign label c* for x' if:
We selected only important variables as features for Naive Bayes classifier. Those features are described below:
Attributes | Description |
---|---|
emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than a year and 10 means 10 or more years |
grade | Loan grade: A,B,C,D |
home_ownership | The home ownership status provided by borrower during registration or obtained from credit report. It can be : RENT, OWN, MORTGAGE, OTHER |
initial_list_status | The initial listing status of the loan. It can be : W or F |
term | The number of payments of the loan. Values are in months and can be either 36 or 60. |
verification_status | Indicate's if income is verified or not verified by LC. |
Sample of the data used for Naive Bayes classification is shown below
The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for other more complex classifiers. Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications such as economic forecasting, data compression and genetics. For example, KNN was leveraged in a 2006 study of functional genomics for the assignment of genes based on their expression profiles.
KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) where x denotes a feature and y denotes the target variable we are trying to predict.
We would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output.
The KNN classifier is also a non parametric and instance-based learning algorithm.
We selected numerical variables only as features for K-Nearest Neighbor classifier as it can handle numerical variables well. We selected below features as inputs to our K-Nearest Neighbor model:
Features | Description |
---|---|
Int_rate | Interest Rate on the loan (Numeric) |
Loan_amt | Amount of loan taken by the customer(Numeric) |
Annual_inc | Annual income of the customer(Numeric) |
Open_acc | Open credit lines in the customer's account(Numeric) |
Revol_bal | Credit revoke balance |
Total_acc | Total number of credit accounts under the customer's name |
Revol_util | Amount of credit the customer is using relative to all relative accounts |
Delinq_amnt | Past-due amount owned on the accounts |
Dti | Ratio of the customer's total monthly debt payments divided by his/her self-reported monthly income |
Sample of the data used for KNN classification is shown below.
Let's take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The "K" is KNN algorithm is the nearest neighbors we wish to take vote from. Let's say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:
The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.
In short, A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.
We use one of the following distance functions to calculate the distance between two points.
The K in KNN is a hyper-parameter that you, as a designer, must pick in order to get the best possible fit for the data set. Intuitively, you can think of K as controlling the shape of the decision boundary we talked about earlier.
When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind" to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged.
On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.
In general, a large K value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much better results than 1NN.
We successfully implemented all the tasks that we had mentioned at the beginning of the project under the "definitely achieve" section. We built a Naive Bayes and a k-Nearest Neighbor model to classify the loan_status in our data. Our models gave decent accuracies of around 75-80%. Naive Bayes models built using Mllib also gave accuracy in the similar range but a bit higher than our model. Although, we are not entirely sure why there is a difference, but we cannot generalize it and say that Mllib works better than our model for every data. As part of the "ideally achieve" section, we were able to build Logistic Regression just using the Python programming language. As we tested it on a smaller dataset we were able to achieve an accuracy of around 70%. This project helped us learn several new things about the Python language, the Spark architecture and helped us gain a better understanding of the Naive Bayes and kNN algorithms.
http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
http://www.saedsayad.com/k_nearest_neighbors.htm
https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/
https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
.