Neural Networks | report | 代做Network | 代做network | 代写Algorithm | 代写Artificial | java | 代写mining | Machine learning | project | unity代做 | ios – DATA MINING ON LOAN APPROVED

DATA MINING ON LOAN APPROVED

Neural Networks | report | 代做Network | 代做network | 代写Algorithm | 代写Artificial | java | 代写mining | Machine learning | project | unity代做 | ios – 这是利用Algorithm进行训练的代写, 对Algorithm的流程进行训练解析, 是比较典型的Neural Networks/report/Network/network/Algorithm/Artificial/java/mining/Machine learning/unity/ios等代写方向, 该题目是值得借鉴的project代写的题目

java代写 代写java

DATSET FOR PREDICTING

DEFAULTERS

By Ashish Pandit

PROJECT REPORT RELEASE PERMISSION

FORM

Rochester Institute of Technology
B. Thomas Golisano College of Computing and Information Sciences
TITLE: Data  mining on Loan Approved Dataset for predicting
Defaulters

I, hereby grant permission to the Wallace Memorial Library reproduce my project in

whole or part.



Date

The project “Data Mining on Loan Approved Dataset for predicting Defaulters” by has

been examined and approved by the following Examination Committee:


Dr. Carol Romanowski Associate Professor Project Committee Chair



ACKNOWLEDGEMENT

I would like to thank Dr. Carol Romanowski, for giving me the opport unity to do my capstone project under her guidance. I am extremely thankful to her for giving me invaluable inputs and ideas, solving my doubts whenever I had any, giving me timely feedbacks after the completion of every milestone and helping me throughout the semester to be able to complete this project with success.

I would also like to thank Dr. Joe Geigel, my colloquium guide, for explaining me how the report should be written and giving me valuable feedback after every milestone presentation and class poster presentation.

ABSTRACT

In todays world, taking loans from financial institutions has become a very common phenomenon. Everyday a large number of people make application for loans, for a variety of purposes. But all these applicants are not reliable and everyone cannot be approved. Every year, we read about a number of cases where people do not repay bulk of the loan amount to the banks due to which they suffers huge losses. The risk associated with making a decision on loan approval is immense. So the idea of this project is to gather loan data from multiple data sources and use data mining algorithms on this data to extract important information and predict if a customer would be able to repay his loan or not. In other words predict if the customer would be a defaulter or not.

Table of Contents

1. INTRODUCTION

1.1 BACKGROUND AND PROBLEM STATEMENT ……………………………………………………………………………….. 8

1.2 GOAL OF THE PROJECT …………………………………………………………………………………………………………………. 8

1.3 WORKFLOW DIAGRAM ………………………………………………………………………………………………………………….. 9

1.4 RELATED WORK …………………………………………………………………………………………………………………………… 10

1.4 HYPOTHESIS …………………………………………………………………………………………………………………………………. 10

2. PREPARING THE DATASET

2.1 DATA GATHERING ……………………………………………………………………………………………………………………….. 11

2.2 DATA PREPARATION AND CLEANING ……………………………………………………………………………………….. 11

3. DATA MINING USING CLASSIFICATION ALGORITHMS ON MERGED DATASET

3.1 HYBRID NAVE BAYES DECISION TREE Algorithm …………………………………………………………….. 20

3.1 NAVE BAYES ALGORITHM …………………………………………………………………………………………………………. 24

3.1 DECISION TREE ALGORITHM ………………………………………………………………………………………………………. 25

3.2 BOOSTING ALGORITHM ………………………………………………………………………………………………………………. 26

3.3 BAGGING ALGORITHM ………………………………………………………………………………………………………………… 28

3.4 Artificial NEURAL network ALGORITHM …………………………………………………………………………… 30

4. ANALYSING SINGLE DATASET USING CLASSIFICATION ALGORITHMS

4.1 NAVE BAYES ALGORITHM …………………………………………………………………………………………………………. 35

4.2 DECISION TREE ALGORITHM ………………………………………………………………………………………………………. 35

4.3 BOOSTING ALGORITHM ………………………………………………………………………………………………………………. 36

4.4 BAGGING ALGORITHM ………………………………………………………………………………………………………………… 36

5. RESULTS AND ANALYSIS

5.1 COMPARISION OF RESULTS ………………………………………………………………………………………………………… 37

5.2 COST SENSITIVE LEARNING ……………………………………………………………………………………………………….. 40

6. CONCLUSION AND FUTURE WORK

  • 6 .1 CONCLUSION
  • 6 .2 FUTURE WORK
    1. REFERENCES

1. INTRODUCTION

1.1 BACKGROUND AND PROBLEM STATEMENT

Importance of loans in our day-to-day life has increased to a great extent. People are becoming more and more dependent on acquiring loans, be it education loan, housing loan, car loan, business loans etc. from the financial institutions like banks and credit unions. However, it is no longer surprising to see that some people are not able to properly gauge the amount of loan that they can afford. In some cases, people undergo sudden financial crisis while some try to scam money out of the banks. The consequences of such scenar ios are late payments or missing payments, defaulting or in the worst-case scenario not being able to pay back those bulk amount to the banks.

Assessing the risk, which is involved in a loan application, is one of the most important concerns of the banks for survival in the highly competitive market and for profitability. These banks receive number of loan applications from their customers and other people on daily basis. Not everyone gets approved. Most of the banks use their own credit scoring and risk assessment techniques in order to analyze the loan application and to make decisions on credit approval. In spite of this, there are many cases happening every year, where people do not repay the loan amounts or they default, due to which these financial institutions suffer huge amount of losses. In this project, data mining algorithms will be used to study the loan-approved data and extract patterns, which would help in predicting the likely defaulters, thereby helping the banks for making better decisions in the future. Multiple datasets from different sources would be combined to form a generalized dataset, and then different Machine learning algorithms would be applied to extract patterns and to obtain results with maximum accuracy.

1.2 GOAL OF THE PROJECT

The primary goal of this project is to extract patterns from a common loan approved dataset, and then build a model based on these extracted patterns, in order to predict the likely loan defaulters by using classification data mining algorithms. The historical data of the customers like their age, income, loan amount, employment length etc. will be used in order to do the analysis. Later on, some analysis will also be done to find the most relevant attributes, i.e., the factors that affect the prediction result the most.

1.3 WORKFLOW OF PROJECT

The diagram below shows the workflow of this project.

Workflow Diagram
Data    from    three   
sources (raw    data)   
9()
Data    Preprocessing   
and Cleaning

Training Data Test Data

Classification  
Algorithms
Model
Classifies  as  defaulter   or  
not defaulter

1.4 RELATED WORK

Lot of work has been done with regards to extracting important data, which can be useful for the financial institutions. My project aim was to gather loan information from multiple sources and applying different classification algorithms, which could give best prediction results. I have taken the reference of the work listed below in order to do my analysis.

Jiang and Li [1] propose a method to improve the prediction results obtained by using Nave Bayes and Decision Tree algorithms separately. They have tried to use this Hybrid method on 36 UCI Datasets and compared the results obtained with the individual algorithms. In this project, I have used this hybrid method on the Loan Approved dataset obtained by merging three data sources.

The paper by Tiwari and Prakash [2] implements ensemble methods (bagging, boosting and blending) on the SONAR dataset and compares the prediction accuracy with individual algorithms like Nave- Bayes, Decision Tree etc. In this project, I have used Boosting and Bagging Ensemble methods on the Loan Approved dataset and Single Lending Club dataset.

The paper by Atiya [4] explains the implementation of Artificial Neural Networks on the Bank dataset for predicting Bankruptcy. In this project, I have used Single Layered and Multi Layered Neural Network methods on the Loan Approved dataset.

1.5 HYPOTHESIS

My hypothesis was to use the Hybrid Nave Bayes Decision Tree, Boosting and Bagging classification algorithms and compare their results with the individual algorithms to see if they give good prediction accuracy as compared to the individual algorithms. I also wanted to use various classification algorithms on the merged dataset and an individual dataset in order to compare the results obtained by using both the datasets.

2. PREPARING THE DATASET

2.1 GATHERING DATA

In the first step of accumulating information, data from previously approved loan datasets from three different sources are gathered together. These datasets are merged to form a common dataset, on which analysis will be done. Table 1 shows details of the datasets:

Table 1: Dataset details
Dataset Name No of attributes No of instances Data Format
Lending Club Loan Data 55 5000 .csv
UCI German Data 20 1000 .csv
ROC Data 11 100 .sav

2.2 DATA PREPARATION AND CLEANING

Tools used for data cleaning

1) Google Refine
2) Weka
3) R (For converting .sav data to .csv format)

One of the most important task for preparing a common dataset is to decide which of the attributes can be used from these three tables, since all of them have different number of attributes and attributes in different forms.

Nine attributes were selected for preparing the new dataset. These attributes being: a) Age of loan applicant (b) Job profile [less, moderately, highly skilled] (c) Annual income (d) Employment length (e) Loan amount (f) Loan duration (g) Purpose of loan (h) Housing [rent, own] and (j) Loan history [Defaulter, Not Defaulter] which is the class attribute.

These attributes were either common to all the three or at least two of the datasets. All the other attributes were logically eliminated from each of the dataset.

Tables 2, 3 and 4 show how the selected attributes look in each of the tables before merging

Table 2: Lending Club Data

Attribute Name

age Job
profile
Income Emp
Length
Loan
Amount
Duration Purpose Housin
g

Loan History Type Missin g

Missin
g
Numeri
c
Nomin
al
Numeric Numeric Nominal Nominal Nominal

Values Example

5000,
80000
<1,
5,
10+
1200,
3500,
11000
36,
60
(in
months)
Car loan,
House
Loan,
Business
Loan etc
Rent,
own
Defaulter,
Not
Defaulter
Table 3: UCI German Data

Attribut e Name

age Job
profile
Income Emp
Length
Loan
Amoun
t
Duratio
n
Purpose Housin
g
Loan
History

Type Numeri c

Nominal Missin
g
Nomina
l
Numeri
c
Numeric Nominal Nominal Nominal

Values Example

33,
50,
46
Less,
Moderately
,
Highly
skilled
<1,
1 to 4,
4 to 7
7+
1200,
3500,
11000
12,
24,
48
in
months
Car
loan,
House
Loan,
Busines
s
Loan etc
Rent,
Own,
Free
Defaulter
,
Not
Defaulter
Table 4: ROC Data

Attribut e Name

age Job
profile
Income Emp
Length
Loan
Amoun
t
Duratio
n
Purpos
e
Housin
g
Loan
History

Type Numeri c

Nominal Numeri
c
Nomina
l
Numeri
c
Missing Missing Missing Nominal

Values Example

33,
50,
46
Less,
Moderately
,
Highly
skilled
5000,
80000
1,
8,
15
1200,
3500,
11000
Defaulter
,
Not
Defaulter
The dataset obtained by merging these three datasets was raw and it needed a lot of cleaning.
Tackling Data Cleaning tasks
1) Age attribute:
As we can see from the above tables the Lending Club Data set does not have any information
regarding the age of the loan applicant. So all of the 5000 values for that attribute are unknown. Now,

ideally one would have removed the entire attribute, but age might be an important factor in determining whether the applicant could be a defaulter or not. So some logical assumptions were made to fill these missing values of age, based on the age and the employment length values of the other two tables. A person who has experience of more than 7 years is very likely to be in his mid 30s where as a person having a experience of 3-4 years is likely to be in his mid or higher twenties. So in this case a Density Based Clustering Algorithm has been used, in order to get a relation or a pattern between the two attributes. This algorithm divides the age and employment length values of other two datasets into four clusters, as shown in Figure 1.1.

Figure 1 .1: Cluster Division

The number of instances in each of the cluster is shown in Figure 1.2:

Figure 1.2: Clustered instances

Cluster 1 and Cluster 2 have a major difference in the average age values although the employment length centroid in both is same age value (1<=X<4). But the number of instances in cluster 1 is very less (9 percent) as compared to 43 percent in cluster 2. So here cluster 1 will not be taken into consideration.

So based on these three cluster groupings, three categories for the values of age are made : (1) <=27 (2) 28 <= X <=37 (3) >=38.

The age values in UCI and ROC datasets are numeric by default. So here both the datasets were combined and text facet feature of Google refine has been used to group all the values of same age together. The attribute has got 53 different numeric age value groups as shown in Figure 1.3. Each of these 53 age group values have been put into their respective category according to the age values.

But the tricky part here is not knowing if the job that the loan applicant is doing is his first job or say his fifth job. So I will also be doing data analysis by not considering the age attribute and see how that affects my prediction accuracy as compared to when age is taken into consideration. Various age groups obtained after applying text facet feature on age attribute is shown in Figure 1.3.

Figure 1.3: text facet on age to group age values in dataset

(2) Employment Length

The employment length values in both Lending Club and ROC datasets are numeric, but in UCI German dataset the employment length are in four bins i.e. less than 1 years, 1 to 4 years, 4 to 7 years and more than 7 years. So these four bins will be common for my entire dataset. The numeric values in the other two datasets will be put into their respective bins. Text facet function of Google refine will be used in order to get groups of all possible age values and then put these groups into their respective bins.

(3) Housing

This is a nominal attribute and has four possible values i.e. rent, own, free and other, as shown in Table

Table 5: Housing attribute categories
Housing Frequency (Number of Values)
rent 4475
own 1405
free 108
Other 30
Missing Values 102

Since the number of instances of free (108 out of 6100 1.7% ) and other (30 out of 6100 0.5% ) values are very less, these two will be merged into one value which is Other. This value will now have 138 instances. Housing cells having the value as rent, account for 74% of values in the dataset. So in this case, all the missing values (102 values) of the housing attribute cells will be filled by the mode value, that is the most occurring value rent. So the housing attribute finally would have only three set of values – Rent (4575 values), own (1405 values) and other (138 values).

(4) Loan Purpose

This is a nominal attribute. Loan purpose attribute in the Lending club had 12 different values, whereas in UCI dataset had 7 different values. All the values for loan purpose were unknown in the ROC dataset.

When the datasets are combined only the following values have considerable amount of instances i.e. car (524 values), credit card (610 values), debt consolidation (2450), house loan or home improvement (927 values) and other (620 values). The probability of occurrence of all the other values is very less. So a new category value named Other/unknown is introduced, that will include all the other remaining category instances. This category will have the 610 values of category other as well. All the missing values of loan purpose will also be filled by other/unknown.

So finally, loan purpose will now have only five category values 1) car 2) credit card 3) debt consolidation 4) house loan/home improvement 5) other/unknown

Various categories of purpose attribute in UCI German Dataset are shown in Figure 1.4.

Figure 1.4: purpose attribute in UCI German Dataset

Various categories of Purpose attribute in Lending Club Dataset are shown in Figure 1.5.

Figure 1.5:Purpose attribute in Lending Club Dataset

(5) Job Profile and Income:

Job profile, income and employment length are the three attributes, which are co-related. As we can very much say that the income of an employee having very high skill but with less experience would be higher than an employee having less skill even with relatively higher experience. This relation will be used in order to find the missing values of Job Profile and Income.

The lending club has all Job profile values as unknown. In this case, Income and employment length values of Lending club and ROC table will be used to extract a pattern. UCI German dataset values have not been used here since it has all the income values as unknowns. So in this case, K-means clustering algorithm is used in order to get a pattern between the attributes. This algorithm divides the attributes values into three clusters. The clusters formed are shown in Figure 1.6.

Figure 1.6: Cluster Division

All the unknown cells of the Job profile column are filled with either value cluster 0, cluster 1 and cluster 2 according to the model created by the K-means clustering algorithm.

The values obtained for the Job_profile attribute by using K-means clustering are as shown in Figure 1.7.

Figure 1.7: K-Means Output- Here the Cluster Column will be renamed as job_profile

Based on the observations of the cluster centroids, cluster 0 which has annual income mean as $ 77928 and employment length as 7+ years has been renamed as highly skilled, cluster 1 which has annual income mean as $47230 and employment length as 1 to 4 years has been renamed as less skilled and cluster 2 which has annual income mean as $59420 and employment length as 4 to 7 years has been renamed as moderately skilled. The cluster 0, 1, 2 instances in the column would be replaced by highly, less and moderately skilled respectively.

Now, the relationship derived by the clustering model will be used in order to fill out all the unknown income values. For example, filling the income value whose corresponding employment length is between 4 to 7 years and Job profile as moderately skilled. In this case, all the instances having 4 to 7 years and moderately skilled will be gathered using text facet function of Open refine and the mean of those income values would be used to fill in those cells. There are 1503 matching instances, all having employment 4 to 7 years and moderately skilled. The arithmetic mean of the matching corresponding income values of these 1503 instances would be calculated which in this case is $ and that value would be inserted into all the blank income cells having 4 to 7 years and moderately skilled.

Similarly, all the other values of the income attribute would be calculated by taking the mean of all income values of the corresponding matching job profile and employment length.

Removing Rows:
Loan history is the class attribute. Its a nominal attribute having two category values i.e. defaulter and
not defaulter. It also has 16 blank or missing values. Now, since this is a class attribute, all the 16 rows
which do not have values for the attributes will be removed. There are a total of 50 loan history
instances having value current. These loan payments are currently in progress and there is no
indication of whether they will default or not in their future payments. So these 50 rows are also deleted.
In addition to this there are 177 rows whose employment length value is blank and it has an income
associated with. But here, we are not sure if it is, the applicants previous jobs income or the income of
his co-signer or his family. Also, we dont have any values for Job profile and age cells of these
rows. A lot of important data is missing in these 177 rows. So these rows would also be deleted.

FINAL DATASET

The final dataset obtained by merging the datasets and cleaning it, has a total of 5857 rows. There is no
missing value in this new dataset. The list of attributes along with their type is shown in Table 6.
Table 6: Final Dataset

Attribute Name

Age Job
profile
Income Emp
Length
Loan
Amount

Duration Purpose Housing Loan History Type Nominal Nominal Numeric Nominal Numeric Numeric Nominal Nominal Nominal

Values Example

<=27,
28<=X<=37,
>=
less,
moderately,
highly
skilled
5000,
80000
<1,
1 to 4,
4 to 7
7+
1200,
3500,
11000
36,
60
in
months
Car
loan,
House
Loan,
Business
Loan etc
rent,
own,
other
Defaulter,
Not
Defaulter
The new dataset obtained after merging the three datasets and cleaning it, has been divided into training
set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on
training set, based on some extracted data patterns by using classification algorithms. These classifier
models are then evaluated by using the test dataset.

3. DATA MINING USING CLASSIFICATION ALGORITHMS

HYBRID NAVE BAYES DECISION TREE ALGORITHM

Nave Bayes and Decision Trees are two of the most important classification algorithms for prediction
purposes, due to their accuracy, easiness and effectiveness. There prediction accuracies can be increased
further by combining the advantages of both the algorithms mentioned above, using a Hybrid Nave
Bayes  Decision Tree algorithm. This algorithm gives high prediction accuracy as compared to Nave
Bayes and Decision Tree used individually, but the time complexity does not increase by a great extent
[1]. The implementation of the algorithm is divided into two parts. In the first part, Nave Bayesian and
Decision tree models are created and assessed individually on the training data. In the second part, the
class probabilities obtained on every instance of the test set, are weightily averaged based on the
classification accuracies obtained on training data [1]. Finally the result of the Hybrid Nave Bayes 
Decision Tree Algorithm, is compared with the results of Nave Bayes and Decision Tree Algorithm
calculated individually.
The algorithm [1] is as follows:
------------------------------------------ Phase 1: (Used WEKA)------------------------------------------
INPUT: Training Data
STEPS:
1) Building a classifier model on Training Data using Decision Tree, denoted by (C4.5)
  1. Evaluating the accuracy of the model on Training data, denoted by (ACCC4.5)
3) Building a classifier model on Training Data using Decision Tree, denoted by (NB)
4) Evaluating the accuracy of the model on Training data, denoted by (ACCNB)
5) Return the models built along with their evaluated accuracies.
OUTPUTS: ( C4.5, ACCC4.5, NB, ACCNB )

——————————————–Phase 2: (Used JAVA)—————————————-

INPUT :

  1. The Models built in the first phase i.e. C4.5, NB
  2. Their respective accuracies ACCC4.5, ACCNB
  3. Test Data instance denoted by x
STEPS :
  1. For every class label (in this case 2 class labels) c of test instance x Here c is either a defaulter or not defaulter
  2. Calculate P(c|x) C4.5 by using the decision tree model (C4.5). The formula is as follows:
Formula reference [1]

Where k is the number of training instances in that particular leaf node where x falls, ci is the class of the test instance x [1].

Formula reference [1]

The value of the function is equal to 0 if both the parameters are not equal and 1 if they are equal [1].

For this dataset, Calculate P(not defaulter|x) C4.5 by using the decision tree model (C4.5) and the above formula

Calculate P(defaulter|x) C4.5 by using the decision tree model (C4.5) and the above formula

  1. Calculate P(c|x)NB by using the Nave Bayesian classifier model (NB). The formula is as follows:

Formula reference [1]

Here, m denotes the total number of attributes, aj is the value of jth attribute of the test instance x and c as mentioned earlier, is the value of the class attribute [1].

In this case, the Prior probability i.e. P(c) [1] is calculated by using the formula:

Formula reference [1]

Here ci is the is the value of the class attribute of the ith training row, n is the total number of rows in training set, nc is the total number of classes (in this case: 2) [1] and

Formula reference [1]

The value of the function is equal to one if both the parameters are equal and zero if they are not equal [1].

The Conditional Probability P(aj|c ) [1] is calculated by using the formula:

Formula reference [1]

Here aj is the value of jth attribute of the test instance x, aij is the value of jth attribute of the training set row or instance i , ci is the is the value of the class attribute of the ith training row and nj is the total number of values that the jth attribute can have [1].

For this dataset, Calculate P(defaulter | x)NB by using the Nave Bayesian model (NB) Calculate P(not defaulter | x)NB by using the Nave Bayesian model (NB)

  1. Calculate P(c|x)C4.5 – NB by using the formula given below:
Formula reference [1]

For this dataset, Calculate P(defaulter | x)C4.5-NB

Calculate P(not defaulter | x) C4.5-NB

  1. Find the maximum value of P(c|x)C4.5 NB which is obtained in the previous step.

For this dataset, If ( P(defaulter | x)C4.5-NB > P(not defaulter | x) C4.5-NB ) { Then the Class Label of the instance is defaulter according to the hybrid algorithm }

else if ( P(not defaulter | x)C4.5-NB > P(defaulter | x) C4.5-NB ){ Then the Class Label of the instance is not defaulter according to the hybrid algorithm

}

OUTPUT : The Class Label ( defaulter or not defaulter ) for a Test Data instance denoted by x

FINAL RESULT:

Phase 2 steps are repeated for all the instances of the test dataset. All the Class labels outputs obtained for every test instance or row, are copied to an excel file and the actual Class labels are also pasted in the same file in the next column. This file in read by a java code and the correctly and the incorrectly classified instances are found by comparing the two columns.

Sample of the probabilities obtained by using the Hybrid algorithm is shown in Figure 2.1.

Figure 2.1 Probabilities obtained after completing step 4

Prediction accuracy and confusion matrix obtained by using Hybrid Nave Bayes Decision Tree algorithm is shown in Figure 2.2.

Figure 2.2 Hybrid Nave Bayes  Decision Tree Algorithm
Hybrid Nave Bayes  Decision Tree Accuracy  73.80 %

Figure 2.3 shows the result of using Nave Bayes Algorithm separately on the merged dataset.

Nave  Bayes Algorithm Result
Figure 2.3 - Nave-Bayes Classification Algorithm using Weka
Nave  Bayes Accuracy  72.40 %

Figure 2.4 shows the result of using Decision Tree (J48) Algorithm separately on the merged dataset.

Decision Tree Algorithm (J48) - Result
Figure 2.4 - J48 Classification Algorithm using Weka
Decision Tree J48 Accuracy  73.26 %

The classification accuracy obtained by using the hybrid algorithm (73.80 %) shows improvement as compared to the accuracy of the individual classification algorithms, in this case, Nave-Bayes and Decision Tree, although by a relatively small extent as shown in Table 7. But it does serve the purpose of using the algorithm i.e. for improving the accuracy of the individual classification algorithms without increasing the time complexity by a great extent. Table 7 compares the accuracies obtained by using Hybrid, Nave Bayes and Decision Tree algorithm respectively.

Table 7: Comparison of Prediction Accuracies
Classification Algorithms
Accuracy
Correctly
Classified
Instances
Incorrectly Classified Instances
Hybrid Nave Bayes  Decision
Tree Algorithm
73.80
%
852/1156 304/1156
Nave Bayes Algorithm 72.40
%
837/1156 319/1156
Decision Tree Algorithm 73.26
%
847/1156 309/1156

ENSEMBLE METHODS

Ensemble methods either use more than one data mining algorithms or use one data mining algorithm multiple times in order to improve the prediction accuracy as compared to the use an algorithm on the dataset.

1) BOOSTING ALGORITHM

In the first iteration of the Boosting algorithm, a classification model is created on the training data by using a data-mining algorithm. The second iteration creates a classification model, which basically concentrates on the instances or the rows that were incorrectly classified in the first iteration [2]. This process goes on until some constraint is reached with regards to the accuracy or number of models [2]. The aim of using Boosting ensemble method is to get better results as compared to the individual classification algorithms.

For this dataset, the AdaBoostM1 classification algorithm will be used. AdaBoostM1 algorithm is tried with different base classifier algorithms like J48 decision tree algorithm, Nave-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for AdaBoostM1. So the final analysis will have AdaBoostM1 using J48. Also, various number of iterations i.e. number of successive models to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30 is selected since it gives the best prediction results.

Figure 3.1 shows the result of using AdaBoostM1 with J48 as the base classifier and Number of successive models to be created (N) = 30.

AdaBoostM1 Algorithm using - J48 as base classifier - Result
Figure 3.1  AdaboostM1 Algorithm using Weka
AdaBoostM1 using J48 Accuracy  74.31 %

Table 8 shows the prediction accuracies of the AdaBoostM1 classification algorithm using different base classifier algorithms and different number of successive models to be created i.e. N.

N = Number of Successive Models

Table 8: AdaBoostM1 Classification Algorithm Accuracy Results

Base Classification Algorithm Used

N = 3 N=10 N=20 N=30

J48 Decision Tree 71.19 % 74.22 % 74.22 % 74.31 %

Nave-Bayes (^) 72.40 % 72.40 % 72.40 % 72.40 % Support Vector Machines (SVM)

72.31 %
70.24 %
70.24 %
70.24 %

K Nearest Neighbors

(KNN)

71.71 %
71.82 %
72.16 %
72.16 %

The base classification algorithm that will be used in this case is the J48 decision tree algorithm , since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using AdaBoostM1 ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm.

2 ) BAGGING ALGORITHM

Bagging is a type of ensemble, which divides the entire training data into various small samples and then creates a separate classifier model for every sample [2]. The results obtained from of all these classifier models are finally merged by using techniques like majority voting or taking average of the results etc. [2]. The main advantage here is that, each sample obtained from the training set is unique. So every classifier model that has been created, will be trained on slightly different unexplored part of a problem.

Like AdaBoostM1, Bagging algorithm is also tried with different base classifier algorithms like J48 decision tree algorithm, Nave-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for Bagging. So the final analysis will have Bagging, using J48. Also, various number of iterations i.e. Number of Samples to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30, is selected since it gives the best prediction results.

Figure 3.2 shows the result of using Bagging with J48 as the base classifier and Number of samples to be created (N) = 30.

Bagging Algorithm using - J48 as base classifier - Result
Figure 3.2  Bagging Algorithm using Weka
Bagging using J48 Accuracy  75.08 %

Here, the inbuilt Bagging algorithm of WEKA will be used. Below are the prediction accuracy of the Bagging classification algorithm using different Base classifier algorithms and different number of Samples to be created i.e. N.

Table 9 shows the prediction accuracies of the Bagging classification algorithm using different base classifier algorithms and different number of samples to be created i.e. N.

Table 9: Bagging Classification Algorithm Results

Base Classification Algorithm Used

N = 3 N=10 N=20 N=30

J48 Decision Tree 73.78 % 74.39 % 74.91 % 75.08 %

Nave-Bayes (^) 72.49 % 71.51 % 72.63 % 72.63 % Support Vector Machines (SVM)

72.31 %
72.31 %
72.31 %
72.31 %

K Nearest Neighbors

(KNN)

71.10 %
71.79 %
71.80 %
71.80 %

The base classification algorithm that will be used in this case is the J48 decision tree algorithm , since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using Bagging ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm. Prediction Accuracies obtained by using J48 individually and J48 with bagging and boosting are compared in Table

Table 10: Comparison of Prediction Accuracies
Classification Algorithms
Accuracy Correctly
Classified
Instances
Incorrectly Classified
Instances
J48 Algorithm 73.26 % 847/1156 309/1156
AdaBoostM1 Algorithm 74.31 % 859/1156 297/1156
Bagging Algorithm 75.08 % 868/1156 288/1156

ARTIFICIAL NEURAL NETWORK

Now-a-days Artificial Neural Network is being considered, as a well-established method, in order to evaluate the loan applications received by the banks and in order to make any approval or rejection decisions. Here, a classifier model is built on the merged dataset by using Single-layer as well as Multilayer Feed-Forward Neural Network Algorithm.

Single Layer Feed Forward Neural Networks [5] consist of an Input Layer , which would have of all the attributes that are used except the class attribute, one Hidden Layer which would have some neurons (number of neurons are specified in the code) and the Output Layer which consists of the class attribute.

Multi Layer Feed Forward Neural Networks [5] consist of an Input Layer , which consists of all the attributes that are used except the class attribute, Hidden Layers which would have some neurons (number of hidden layers and neurons are specified in the code) and the Output Layer which consists of the class attribute. Normally, a neural network has two hidden layers in a Multilayer Feed Forward Neural Network.

Both these algorithms have been implemented in the R environment. An inbuilt package, neuralnet is used to run both the algorithms.

The basic command for running this algorithm is:

NeuralNetworkResult <- neuralnet (formula, dataset, hidden, algorithm, stepmax)

Here the argument meanings are as follows:

1) Formula : It specifies the classifier attribute and then lists all the attributes to be considered for building the model on the class attribute.

2) Dataset: This argument will have the name of the dataset, on which model will be built on.

3) Hidden: This argument specifies, the number of hidden layers i.e. whether it is Single layered or Multi Layered neural network. It also specifies the number of neurons a layer would have.

4) Algorithm: This argument specifies the algorithm, which will be used to create a neural network. By default, the algorithm used for creating a neural network is Resilient BackPropogation Algorithm indicated by rprop+ .

5) Stepmax: This argument specifies the maximum number of steps that can be used to build the neural network.

The Basic code used for building a Single Layered Neural Network in this case is as follows:

nn <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership , data = nnet_train, algorithm =’rprop+’, hidden = 3, stepmax=1e6)

(Here the Single hidden Layer would have 3 neurons)

The basic code used for building a Multilayered Neural Network in this case is as follows:

n <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership , data = nnet_train, algorithm =’rprop+’, hidden = c(3,2), stepmax=1e6)

(Here the first hidden Layer would have 3 neurons and the second would have 2 neurons)

Figure 4.1 shows the network obtained by using Single Layer Neural network.

Figure 4.1  Single Layer Feed Forward Neural Network using R

Sample of Code Snippet and Result obtained by using Single Layer Neural Network is shown in Figure 4.2.

Figure 4.2  Single Layer (3 Hidden neurons) Neural Network Accuracy: 71.62 %

Figure 4.3 shows the network obtained by using Multilayer Neural network.

Figure 4.3  Multi-Layer Feed Forward Neural Network using R

Result obtained by using Multilayer Neural Network is shown in Figure 4.4.

Figure 4.4  Multi-Layer (5 Hidden Neurons) Neural Network Accuracy: 72.57 %
Table 11: Comparing the Results of Single Layer and Multilayer Neural Network
Classification Algorithm
Accuracy
Correctly Classified
Incorrectly Classified
Single Layer Neural Network
71.62%
828 / 1156
328 / 1156
Multi Layer Neural Network
72.57%
839 / 1156
317 / 1156

4. Analyzing Single Dataset (Lending Club) using Classification

Algorithms

Here some of the algorithms used in the earlier milestones, are applied again to build models on the Lending Club Dataset and then the models created are evaluated for their accuracy. This dataset has 4586 instances and 24 attributes. It has been divided into training set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on training set, based on some extracted data patterns by using Classification algorithms. These classifier models are then evaluated by using the test dataset.

Attributes of this dataset are as follows:

Loan_amount Numeric Open_accounts – Numeric

Loan Term Nominal Public_record – Numeric

Installment_rate Nominal Revolving_balance – Numeric

Installment Numeric Revolving_until – Nominal

Grade Nominal Total_accounts – Numeric

Sub_grade Nominal Total_payment – Numeric

Employment_length Nominal loan_status – Nominal

Home_ownership Nominal

Annual_income Numeric

Loan_purpose Nominal

Zip_code Nominal

Verification_status Nominal

Address_state Nominal

Issue_date Nominal

Earliest_credit_line Nominal

Inquiry_last_6months Numeric

Figure 5.1 shows the result of using Decision Tree (J48) Algorithm on the Lending Club Dataset.

Figure 5.1 - J48 Classification Algorithm using Weka
Decision Tree J48 Accuracy  77.68 %

Figure 5.2 shows the result of using Nave Bayes Algorithm on the Lending Club Dataset.

Figure 5.2 - Nave-Bayes Classification Algorithm using Weka
Nave  Bayes Accuracy  74.17 %

Figure 5.3 shows the result of AdaBoostM1 Algorithm using Decision Stump as base classifier on Lending Club Dataset.

Figure 5.3  AdaboostM1 Algorithm using Weka
AdaBoostM1 using Decision Stump- Accuracy  90.49 %

Figure 5.4 shows the result of Bagging Algorithm using – J48 as base classifier on Lending Club Dataset.

Figure 5.4  Bagging Algorithm using Weka
Bagging using J48 Accuracy  85.43 %

5. RESULTS AND ANALYSIS:

In this section, we will be studying and analyzing the results, which have been obtained by building models on Loan Approved Merged dataset and Lending Club dataset, using various classification algorithms. We will also be looking to get an insight of the most relevant attributes that help in predicting the results correctly. Prediction accuracies obtained on the Merged dataset and Lending Club dataset using various classification algorithms is shown in Table 12 and Table 13 respectively.

Table 12: Comparison of the results obtained on the Merged dataset
ALGORITHM USED CLASSIFICATION
ACCURACY
CORRECTLY
CLASSIFIED
INSTANCES
INCORRECTLY
CLASSIFIED
INSTANCES

Nave Bayes 72.40% 837 / 1156 319 / 1156

J48 73.26% 847 / 1156 309 / 1156

Nave Bayes J48 Hybrid

73.80%
852 / 1156
305 / 1156

Boosting (AdaBoostM1) 74.31% 859 / 1156 297 / 1156

Bagging 75.08% 868 / 1156 288 / 1156 Single Layer Neural Network 71.62% 828 / 1156 328 / 1156

Multi Layer Neural Network 72.57% 839 / 1156 317 / 1156 (Highest Accuracy obtained by using Bagging has been highlighted)

Table 13: Comparison of the results obtained on Lending Club dataset

ALGORITHM USED CLASSIFICATION ACCURACY

CORRECTLY
CLASSIFIED
INSTANCES
INCORRECTLY
CLASSIFIED
INSTANCES

Nave Bayes 74.17% 718 / 968 250 / 968

J48 77.68% 752 / 968 216 / 968

Boosting (AdaBoostM1) 90.49% 876 / 968 92 / 968

Bagging 85.43% 827 / 968 141 / 968 (Highest Accuracy obtained by using Boosting has been highlighted)

Observations and Analysis

  1. As we can see, the Classification Accuracies obtained on the single Lending Club dataset seem to be relatively higher or in some cases much higher than the classification accuracy obtained on the merged dataset using same algorithms. Now, let us analyze the results obtained for both datasets using J48 and Bagging algorithm.

Figure 6.1 and figure 6.2 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm.

A B Classified as A B Classified as

736 49 | A = not defaulter 606 78 | A = not defaulter

260 111 | B = defaulter 138 146 | B = defaulter

Figure 6.1: Confusion Matrix for Merged Figure 6.2: Confusion Matrix for Lending Club

Dataset Using J48 Dataset Using J48

Figure 6.3 and figure 6.4 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Bagging algorithm.

A B Classified as A B Classified as

750 35 | A = not defaulter 682 2 | A = not defaulter

253 118 | B = defaulter 139 145 | B = defaulter

Figure 6.3: Confusion Matrix for Merged Figure 6.4: Confusion Matrix for Lending Club

Dataset Using Bagging Dataset using Bagging

According to the figure 6.1, the J48 algorithm correctly predicts 847 instances out of 1156 instances. Here the classification accuracy is 73.26 %. Figure 6.2 states that the J48 algorithm correctly predicts 752 instances out of 968 instances. The classification accuracy in this case is 77.68 %. The confusion matrix for merged dataset using Bagging algorithm shown in figure 6.3, states that 868 instances are correctly classified out of 1156 instances. The classification accuracy is 75.08 %. The confusion matrix for lending club dataset using Bagging shown in figure 6.4 states that, 827 instances are correctly classified out of 968 instances. The classification accuracy is 85.43 %.

So the classification accuracy obtained for the Lending Club dataset is relatively higher as compared to the accuracy obtained on the merged dataset. While merging the datasets, all the uncommon attributes were removed. The less prediction accuracy for the merged dataset as compared to Lending club dataset, can be due to the fact that, some attributes having useful information for identifying the instances as defaulters and not defaulters, are missing. These attributes would have helped the algorithms to get a

much better understanding of the patterns in the dataset while creating a model, thereby improving the prediction accuracy.

  1. Another thing to notice about in both the datasets, especially, in case of the Merged dataset is that, although the overall prediction accuracy is good, all the algorithms used are not very good at correctly predicting the defaulters but excellent at predicting the non defaulters.

Figure 6.5 and figure 6.6 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm.

A B Classified as A B Classified as

736 49 | A = not defaulter 606 78 | A = not defaulter

260 111 | B = defaulter 138 146 | B = defaulter

Figure 6.5: Confusion Matrix for Merged Figure 6.6: Confusion Matrix for Lending Club

Dataset Using J48 Dataset Using J48

As per figure 6.5, the J48 algorithm correctly classifies 736 instances as a not defaulter from a total number of 785 instances whose class was actually not defaulter. So it has a classification accuracy of 93.7% in case of all the instances having class as not defaulter. On the other hand, it correctly classifies only 111 instances as a defaulter from a total number of 371 instances, whose class was actually defaulter. So it has a classification accuracy of just 29.9% in case of all the instances having class as defaulter.

According to figure 6.6, J48 algorithm correctly classifies 606 instances as a not defaulter from a total number of 684 instances whose class was actually not defaulter. It has a classification accuracy of 88.59% in case of all the instances having class as not defaulter. On the other hand, J48 correctly classifies 146 instances as a defaulter from a total number of 284 instances, whose class was actually defaulter. It has a classification accuracy of 51.4% in case of all the instances having class as defaulter.

Let us consider one more example of another classification algorithm on both the datasets.

Figure 6.7 and figure 6.8 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Nave Bayes algorithm.

A B Classified as A B Classified as

728 57 | A = not defaulter 537 147 | A = not defaulter

262 109 | B = defaulter 103 181 | B = defaulter

Figure 6.7: Confusion Matrix for Merged Figure 6.8: Confusion Matrix for Lending Club

Dataset Using Nave-Bayes Dataset Using Nave-Bayes

In case of the merged dataset, the Nave- Bayes has a classification accuracy of 92.73% in case of all the instances having class as not defaulter and has an accuracy of only 29.38% in case of all the instances having class as defaulter.

On the other hand, in case of the Lending Club dataset, the Nave- Bayes has a classification accuracy of 79% in case of all the instances having class as not defaulter and has an accuracy of 63.73% in case of all the instances having class as defaulter.

As we can see, the prediction accuracy of all the not defaulter instances remain very good for both the datasets. But the accuracy of all the defaulter instances is not that good for both the datasets, especially in the Merged dataset.

The Train-Test dataset split was set to 70:30 percent for the both datasets to see if there is any change in the behavior of the models. The prediction accuracy almost remained the same. The overall accuracy just got reduced by a little margin. Also there was hardly any difference in the prediction accuracy of the ‘defaulter’ class instances. The reason for not so good prediction accuracy of defaulter instances might be that, the attributes in both the datasets provide adequate information about the characteristics of a non defaulter, but do not reveal any vital information, that can be used to correctly classify the applicant as a defaulter. In case of the merged dataset, the reason can also be, the removal of uncommon attributes like applicant address, interest rate, applicant grade, credit enquires in last 6 months etc. while merging the 3 datasets, or presence of many missing values which were guessed while performing data cleaning.

One major reason for the low accuracy of defaulter instances could also be that, in both the datasets, the number of rows having class as not defaulter is much greater than the number of rows having class as defaulter. So the datasets have class imbalance.

All these things might have lead to lack of important information or loss of information to clearly differentiate the prediction of two classes. The result being, all the algorithms are biased in predicting an applicant as a not defaulter.

To tackle this problem, Cost Sensitive Learning method has been used. This method basically punishes the algorithm for falsely classifying the defaulter instances as not defaulters. In this approach,

although the overall accuracy of prediction goes down by some margin, the prediction accuracy of defaulters goes up considerably.

COST SENSITIVE LEARNING

Cost Sensitive Learning method can be used with any algorithm like Nave Bayes, Decision tree, Bagging etc.

For our datasets, it has the following default Cost Matrix. The values in it are the cost sensitive weights.

Default Cost Matrix

0.0 1.0

1.0 0.0

New Cost Matrix after changing some weight values.

0.0 1.0

2.1 0.0

Various values of weights were tried and finally the above cost matrix was used since it balanced the classifier result considerably and gave decent prediction results.

Naive Bayes Result without using Cost Sensitive Learning on Merged Dataset is shown in Figure 7.1.

Figure 7.1:Defaulter instances prediction accuracy  29.38 % (Overall Accuracy  72.40 %)

Naive Bayes result with Cost Sensitive Learning on Merged Dataset is shown in Figure 7.2.

Figure 7.2: Defaulter instances prediction accuracy  61.01 % (Overall Accuracy  64.96 %)

Although the overall accuracy of the classifier reduces to around 65%, but the prediction results for defaulter class instances improves considerably and the result no longer looks to be biased in predicting an applicant as a not defaulter. Similarly, Cost Sensitive Learning method is used with some other algorithms on both the datasets as shown in Table 14 and Table 15.

Table 14: Results obtained on Merged dataset with and without Cost Sensitive Learning

ALGORITHM USED

OVERALL     
CLASSIFICATION  
ACCURACY    
WITHOUT USING
COST    SENSITIVE
LEARNING
CLASSIFICATION  
ACCURACY    OF
DEFAULTER
INSTANCES
WITHOUT 
USING
COST    SENSITIVE
LEARNING
OVERALL
CLASSIFICATION  
ACCURACY
USING
COST    SENSITIVE
LEARNING
CLASSIFICATION  
ACCURACY    OF
DEFAULTER
INSTANCES
USING
COST    SENSITIVE
LEARNING

Nave Bayes 72.40 % 29.38 % 64.96 % 61.01 %

J48 73.26 % 29.91 % 67.21 % 57.60 %

Boosting

(AdaBoostM1)

74.30 %
36.92 %
65.83 %
55.97 %

Bagging 75.08 % 31.80 % 65.74 % 57.41 %

Table 15: Results obtained on Lending Club dataset with and without Cost Sensitive Learning
ALGORITHM   
USED
OVERALL
CLASSIFICATION  
ACCURACY    
WITHOUT USING
COST    SENSITIVE
LEARNING
CLASSIFICATION  
ACCURACY    OF
DEFAULTER
INSTANCES
WITHOUT USING
COST    SENSITIVE
LEARNING
OVERALL
CLASSIFICATION  
ACCURACY
USING
COST    SENSITIVE
LEARNING
CLASSIFICATION  
ACCURACY    OF
DEFAULTER
INSTANCES
USING
COST    SENSITIVE
LEARNING

Nave Bayes 74.17 % 63.78 % 69.52 % 73.94 %

J48 77.68 % 51.14 % 71.07 % 60.21 %

Boosting

(AdaBoostM1)

90.49 %
69.01 %
85.43 %
75.70 %

Bagging 85.43 % 51.0 % 85.12 % 52.81 %

By looking at both the tables it can be seen that the use of Cost Sensitive Learning reduces the overall prediction accuracy by some margin, but the prediction accuracy of defaulter instances increases considerably and the classification results no longer look like they are biased in predicting the instances as not defaulter thereby reducing classification imbalance.

  1. The classification accuracy obtained by using the Hybrid algorithm (73.80 %) is higher than the accuracy of the individual classification algorithms, in this case, Nave-Bayes (72.40 %) and Decision Tree (73.26 %), although by a relatively small extent. In this approach, a Nave Bayes classifier is built on every leaf node of the decision tree [1]. The hybrid algorithm combines the advantages of both Nave Bayes and Decision Tree. Thus it does serves the purpose of improving the accuracy of the individual classification algorithms without increasing the overall time complexity by a great extent.
  2. Another important observation is that, when we use AdaBoostM1 Ensemble Methods on both the datasets, the overall accuracy of predicting the applicant correctly, as either a not defaulter or defaulter, increases as compared to the accuracy obtained by using any individual algorithm. For

example, the J48 algorithm gives a prediction accuracy of 73.26 % when used on the Loan Approved Merged dataset, whereas the AdaBoostM1 method with J48 as the base classifier algorithm gives a highest prediction accuracy of 74.31 % on the same dataset. This is because, AdaBoostM1 algorithm creates successive models on the training data, and each model, focuses primarily on the instances that are incorrectly classified by the model in its previous iteration [2]. So, after every iteration, the prediction accuracy either remains same or it increases as compared to its previous model. This successive model creation goes on unless some limitation has been attained in terms of the accuracy or number of models.

  1. Similarly, when we use AdaBoostM1 Ensemble Methods on both the datasets, the overall accuracy of predicting the applicant correctly, as either a not defaulter or defaulter, increases as compared to the accuracy obtained by using any individual algorithm. For example, the J48 algorithm gives a prediction accuracy of 73.26 % when used on the Loan Approved Merged dataset, whereas the AdaBoostM1 method with J48 as the base classifier algorithm gives a highest prediction accuracy of 75.08 % on the same dataset. This is because the Bagging algorithm creates multiple samples by dividing the training data. Then a separate classifier model is created for every sample. All these classifier model results are then migrated to form one model. The advantage to this being, every data sample is unique, so the model created on these samples would have some unique additional information, which would add up and help in achieving a more accurate prediction performance [2].
  2. In order to get insight about the most relevant attributes, we use Information Gain as an attribute evaluator and Ranker Search as the algorithm on the Loan Approved Merged dataset. According to this,

Loan_Duration , emp_length and age are the most important factors for predicting the class of the loan applicant (whether the applicant would default or not) in case of the Merged dataset. Job profile and home_ownership are found to be the least significant attributes for prediction.

Same method is used on the Lending Club dataset as well. According to this method, total_payment ,

zipcode and interest rate are the most important factors for predicting the class of the loan applicant (whether the applicant would default or not) in case of the Lending Club dataset. Installment amount and total number of accounts of the applicant are found to be the least significant attributes for prediction.

  1. Out of all the classification algorithms used on the Merged dataset, Bagging Algorithm with J48 as its base classifier gives the best overall prediction accuracy results i.e. 75.08 %.
  2. Out of all the classification algorithms used on the Lending Club dataset, AdaBoostM1 Algorithm with Decision Stump as its base classifier gives the best overall prediction accuracy results i.e. 90.49 %.
  3. Although the overall prediction accuracy is slightly reduced, Cost Sensitive learning used with Nave Bayes, gives the best prediction accuracy for defaulter instances of the Merged dataset i.e. 6 1.02 %.
  1. Although the overall prediction accuracy is slightly reduced, Cost Sensitive learning used with Boosting, gives the best prediction accuracy for defaulter instances of the Lending Club dataset i.e. 75.70 %.

6. CONCLUSION AND FUTURE WORK

CONCLUSION:

  • Although the overall prediction accuracy is good for both the datasets, the prediction accuracy of defaulter instances is not that good using all the algorithms. The major reason for this could be the class imbalance i.e. high number of instances having class as not defaulters, which results in biased output.
  • The prediction accuracy of defaulter instances obtained by using Cost Sensitive Learning are considerably good as compared to results obtained by not using it. The overall classification results also seem to be relatively balanced.
  • Out of all the classification algorithms used on the Lending Club dataset, AdaBoostM1 Algorithm with Decision Stump as its base classifier gives the best overall prediction accuracy.
  • Out of all the classification algorithms used on the Merged dataset, Bagging Algorithm with J48 as its base classifier gives the best overall prediction accuracy.
  • Loan_Duration , emp_length and age are the most important factors for predicting the class of the loan applicant (whether the applicant would default or not) in case of the Merged dataset
  • zipcode and interest rate are the most important factors for predicting the class of the loan applicant (whether the applicant would default or not) in case of the Lending Club dataset.

FUTURE WORK:

  • Time Series Analysis can be done using the Loan data of several years, for prediction of the approximate time, when the client can default.
  • Future analysis can be done on predicting the approximate Interest rates that the loan applicant is expected to get as per his profile if his loan is approved. This can be useful for loan applicants, since some banks approve loans, but give very high interest rates to the customer. It would give the customers a rough insight regarding the interest rates that they should be getting for their profile and it will make sure they dont end up paying much more amount in interest to the bank.
  • An application can be built, which will take various inputs from the user like, Employment Length, Salary, Age, marital status, SSN, address, loan amount, loan duration etc. and give a prediction of whether their loan application can be approved by the banks or not based on their inputs along with an approximate interest rates.

7. REFERENCES:

  1. Jiang, Liangxiao, and Chaoqun Li. ” Scaling Up the Accuracy of Decision-Tree Classifiers: A Naive- Bayes Combination .” Journal of Computers 6.7 (2011): 1325-1331.
  2. Tiwari, Aakash, and Aditya Prakash. ” Improving classification of J48 algorithm using bagging, boosting and blending ensemble methods on SONAR dataset using WEKA .”
  3. Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123- 140.
  4. Atiya, Amir F. ” Bankruptcy prediction for credit risk using neural networks: A survey and new results. ” IEEE Transactions on neural networks 12.4 (2001): 929-935.
  5. Khan, Azeem Ush Shan, Nadeem Akhtar, and Mohammad Naved Qureshi. ” Real-time credit-card fraud detection using artificial neural network tuned by simulated annealing algorithm .” Proceedings of International Conference on Recent Trends in Information, Telecommunication and Computing , ITC. 2014.