1. Introduction

In this project I use the modified version of popular credit dataset called South German Credit which was sourced from UCI's ML repo. The data set contains 1000 observations with 20 predictor variables. I implement machine learning library, scikit-learn, to build and evaluate two methods (Logistic Regression and Gradient Boosted Decision Trees) to predict loan defaults. The attribute coding on this dataset has made some of the predictor variables difficult to work with or irrelevant; therefore, I do not work with all 20 predictor variables.

Content

The predictor variables used in my models are:

amount : This is a continuous variable representing the loan amount in Deutsche Marks

age : This is the age of the customer applying for the loan

Status : This is a categorical variable which represents the status of the customer's checking account.

                    - When Status = 1: The customer does not have a checking account
                         - Status = 2: The customer has a checking account balance less than 0DM
                         - Status = 3: The customer has a checking account balance between 0 and 200DM
                         - Status = 4: The customer has a checking account balance greater than 200DM  

Housing : This is a categorical variable representing the customer's property type.

                    - When Housing = 1: The customer _rents_ their primary residence
                         - Housing = 2: The customer _owns_ their primary residence
                         - Housing = 3: The customer does not pay rent on & does not own their primary residence  

Credit_History : This is a categorical variable representing the customer's credit history.

                    - When Credit_History = 0: The customer has a history of delayed payments  
                         - Credit_History = 1: The customer has a critical account/other credits elsewhere  
                         - Credit_History = 2: The customer has not taken credit in past or credit has been paid back
                         - Credit_History = 3: The customer's existing credits paid back duly until now
                         - Credit_History = 4: The customer has paid back all previous credits on time with this bank

job : This is a categorical variable representing the type of job held by the customer.

                    - When job = 1: The customer is unemployed or an unskilled non-resident
                         - job = 2: The customer is an unskilled resident
                         - job = 3: The customer is a skilled employee/official
                         - job = 4: The customer is a manager/self-employed/highly qualified employee


employment_duration : This is a categorical variable representing the customer's employment duration.

                    - When employment_duration = 1: The customer is unemployed
                         - employment_duration = 2: The customer has been employed for less than 1 year
                         - employment_duration = 3: The customer has been employed between 1 and 4 years
                         - employment_duration = 4: The customer has been employed between 4 and 7 years
                         - employment_duration = 5: The customer has been employed for more than 7 years

savings : This is a categorical variable representing the customer's saving account balance.

                    - When savings = 1: The customer does not have a savings account (or it is unknown)
                         - savings = 2: The customer has less than 100DM in their savings
                         - savings = 3: The customer has between 100DM and 500DM in their savings
                         - savings = 4: The customer has between 500DM and 1000DM in their savings
                         - savings = 5: The customer has more than 1000DM in their savings

credit_risk : This is a binary variable representing whether or not the customer is considered a good risk.

                    - When credit_risk = 0: The customer is a bad credit risk (default)
                         - credit_risk = 1: The customer is a good credit risk (non-default)

2. Importing libraries & data

In [15]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# Importing the data
df = pd.read_csv("C:/Users/camer/OneDrive/Desktop/Python_Fin_Projects/SouthGermanCredit.asc",sep=' ')

3. Exploring the Dataset

  • Check the datatypes present
  • Inspect the top few rows
  • Check for nulls
  • Make any adjustments
In [2]:
# Begin inspecting the data
print(df.dtypes)
df.head()
laufkont    int64
laufzeit    int64
moral       int64
verw        int64
hoehe       int64
sparkont    int64
beszeit     int64
rate        int64
famges      int64
buerge      int64
wohnzeit    int64
verm        int64
alter       int64
weitkred    int64
wohn        int64
bishkred    int64
beruf       int64
pers        int64
telef       int64
gastarb     int64
kredit      int64
dtype: object
Out[2]:
laufkont laufzeit moral verw hoehe sparkont beszeit rate famges buerge ... verm alter weitkred wohn bishkred beruf pers telef gastarb kredit
0 1 18 4 2 1049 1 2 4 2 1 ... 2 21 3 1 1 3 2 1 2 1
1 1 9 4 0 2799 1 3 2 3 1 ... 1 36 3 1 2 3 1 1 2 1
2 2 12 2 9 841 2 4 2 2 1 ... 1 23 3 1 1 2 2 1 2 1
3 1 12 4 0 2122 1 3 3 3 1 ... 1 39 3 1 2 2 1 1 1 1
4 1 12 4 0 2171 1 3 4 3 1 ... 2 38 1 2 2 2 2 1 1 1

5 rows × 21 columns

In [16]:
# All the data is represented as int datatype but headers are in German - I will convert to English using guide on UCI website
list1 = ['status','duration','credit_history','purpose','amount','savings','employment_duration','installment_rate',
         'personal_status_sex','other_debtors','present_residence','property','age','other_installment_plans'
        ,'housing','number_credits','job','people_liable','telephone','foreign_worker','credit_risk']
df.columns = list1
df.head()
Out[16]:
status duration credit_history purpose amount savings employment_duration installment_rate personal_status_sex other_debtors ... property age other_installment_plans housing number_credits job people_liable telephone foreign_worker credit_risk
0 1 18 4 2 1049 1 2 4 2 1 ... 2 21 3 1 1 3 2 1 2 1
1 1 9 4 0 2799 1 3 2 3 1 ... 1 36 3 1 2 3 1 1 2 1
2 2 12 2 9 841 2 4 2 2 1 ... 1 23 3 1 1 2 2 1 2 1
3 1 12 4 0 2122 1 3 3 3 1 ... 1 39 3 1 2 2 1 1 1 1
4 1 12 4 0 2171 1 3 4 3 1 ... 2 38 1 2 2 2 2 1 1 1

5 rows × 21 columns

In [4]:
# Checking for null values - looks like there are none
sns.heatmap(df.isnull(),cbar=False)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x20ab730ffd0>

3.1 Visualizations

  • Try to better understand the distributions of the continuous variables and relationship with credit risk
In [52]:
# Plot the distribution of loan amounts
sns.distplot(df.amount,bins='auto',color='b')
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x19750e71668>
In [53]:
# Visualize distribution of loan amount vs age
sns.jointplot(x='amount',y='age',data=df)
Out[53]:
<seaborn.axisgrid.JointGrid at 0x19750e090b8>
In [54]:
sns.distplot(df.age,bins='auto')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x19750ea4128>
In [55]:
# Create a cross table of housing situation, credit risk, and current account balance (status)
print(pd.crosstab(df['housing'], df['credit_risk'], df['status'], aggfunc='mean'))
credit_risk         0         1
housing                        
1            1.785714  2.678899
2            1.978495  2.926136
3            1.772727  2.682540
In [56]:
print(df['status'].value_counts())
sns.countplot(x='status',hue='credit_risk',data=df)
4    394
1    274
2    269
3     63
Name: status, dtype: int64
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x19750dc99e8>
In [17]:
# Modify the features so they are all binary
df.drop(df[df.status == 1].index, inplace=True) # Analysing only people with checking accounts
df.drop(df[df.housing == 3].index, inplace=True) # Analysing only people who rent or own a property.
dummy_col = ['status','housing','credit_history','job','employment_duration','savings']
for col in dummy_col:
    df = df.merge(pd.get_dummies(df[col], prefix=col, drop_first=True), left_index=True, right_index=True)
df = df.merge(pd.get_dummies(df.credit_risk, prefix='credit_risk'), left_index=True, right_index=True)
df.rename(columns={'credit_risk_0':'bad_credit_risk'}, inplace=True)
df.describe().transpose()
Out[17]:
count mean std min 25% 50% 75% max
status 658.0 3.183891 0.940181 2.0 2.0 4.0 4.0 4.0
duration 658.0 20.022796 11.594667 4.0 12.0 18.0 24.0 72.0
credit_history 658.0 2.621581 1.071829 0.0 2.0 2.0 4.0 4.0
purpose 658.0 3.004559 2.790773 0.0 1.0 3.0 3.0 10.0
amount 658.0 3123.655015 2668.220540 250.0 1361.0 2249.0 3832.0 18424.0
savings 658.0 2.273556 1.617428 1.0 1.0 1.0 4.0 5.0
employment_duration 658.0 3.366261 1.173484 1.0 3.0 3.0 4.0 5.0
installment_rate 658.0 2.927052 1.121601 1.0 2.0 3.0 4.0 4.0
personal_status_sex 658.0 2.696049 0.720656 1.0 2.0 3.0 3.0 4.0
other_debtors 658.0 1.132219 0.464078 1.0 1.0 1.0 1.0 3.0
present_residence 658.0 2.703647 1.093228 1.0 2.0 2.0 4.0 4.0
property 658.0 2.180851 0.952030 1.0 1.0 2.0 3.0 4.0
age 658.0 34.670213 10.550699 19.0 27.0 33.0 40.0 74.0
other_installment_plans 658.0 2.706687 0.671660 1.0 3.0 3.0 3.0 3.0
housing 658.0 1.826748 0.378753 1.0 2.0 2.0 2.0 2.0
number_credits 658.0 1.424012 0.574311 1.0 1.0 1.0 2.0 4.0
job 658.0 2.899696 0.647416 1.0 3.0 3.0 3.0 4.0
people_liable 658.0 1.863222 0.343874 1.0 2.0 2.0 2.0 2.0
telephone 658.0 1.407295 0.491704 1.0 1.0 1.0 2.0 2.0
foreign_worker 658.0 1.966565 0.179905 1.0 2.0 2.0 2.0 2.0
credit_risk 658.0 0.784195 0.411693 0.0 1.0 1.0 1.0 1.0
status_3 658.0 0.083587 0.276977 0.0 0.0 0.0 0.0 1.0
status_4 658.0 0.550152 0.497857 0.0 0.0 1.0 1.0 1.0
housing_2 658.0 0.826748 0.378753 0.0 1.0 1.0 1.0 1.0
credit_history_1 658.0 0.033435 0.179905 0.0 0.0 0.0 0.0 1.0
credit_history_2 658.0 0.515198 0.500149 0.0 0.0 1.0 1.0 1.0
credit_history_3 658.0 0.101824 0.302646 0.0 0.0 0.0 0.0 1.0
credit_history_4 658.0 0.313070 0.464095 0.0 0.0 0.0 1.0 1.0
job_2 658.0 0.205167 0.404131 0.0 0.0 0.0 0.0 1.0
job_3 658.0 0.630699 0.482983 0.0 0.0 1.0 1.0 1.0
job_4 658.0 0.144377 0.351739 0.0 0.0 0.0 0.0 1.0
employment_duration_2 658.0 0.183891 0.387690 0.0 0.0 0.0 0.0 1.0
employment_duration_3 658.0 0.343465 0.475226 0.0 0.0 0.0 1.0 1.0
employment_duration_4 658.0 0.188450 0.391368 0.0 0.0 0.0 0.0 1.0
employment_duration_5 658.0 0.232523 0.422762 0.0 0.0 0.0 0.0 1.0
savings_2 658.0 0.124620 0.330539 0.0 0.0 0.0 0.0 1.0
savings_3 658.0 0.075988 0.265180 0.0 0.0 0.0 0.0 1.0
savings_4 658.0 0.060790 0.239127 0.0 0.0 0.0 0.0 1.0
savings_5 658.0 0.203647 0.403016 0.0 0.0 0.0 0.0 1.0
bad_credit_risk 658.0 0.215805 0.411693 0.0 0.0 0.0 0.0 1.0
credit_risk_1 658.0 0.784195 0.411693 0.0 1.0 1.0 1.0 1.0
In [18]:
# Keeping only the dummy variables & continuous (age/amount)
del_col = dummy_col + ['duration','purpose','installment_rate','personal_status_sex','other_debtors','present_residence'
                      ,'property','other_installment_plans','number_credits','people_liable','telephone','foreign_worker'
                      ,'credit_risk','credit_risk_1']
for col in del_col:
    del df[col]
df.describe().transpose()
Out[18]:
count mean std min 25% 50% 75% max
amount 658.0 3123.655015 2668.220540 250.0 1361.0 2249.0 3832.0 18424.0
age 658.0 34.670213 10.550699 19.0 27.0 33.0 40.0 74.0
status_3 658.0 0.083587 0.276977 0.0 0.0 0.0 0.0 1.0
status_4 658.0 0.550152 0.497857 0.0 0.0 1.0 1.0 1.0
housing_2 658.0 0.826748 0.378753 0.0 1.0 1.0 1.0 1.0
credit_history_1 658.0 0.033435 0.179905 0.0 0.0 0.0 0.0 1.0
credit_history_2 658.0 0.515198 0.500149 0.0 0.0 1.0 1.0 1.0
credit_history_3 658.0 0.101824 0.302646 0.0 0.0 0.0 0.0 1.0
credit_history_4 658.0 0.313070 0.464095 0.0 0.0 0.0 1.0 1.0
job_2 658.0 0.205167 0.404131 0.0 0.0 0.0 0.0 1.0
job_3 658.0 0.630699 0.482983 0.0 0.0 1.0 1.0 1.0
job_4 658.0 0.144377 0.351739 0.0 0.0 0.0 0.0 1.0
employment_duration_2 658.0 0.183891 0.387690 0.0 0.0 0.0 0.0 1.0
employment_duration_3 658.0 0.343465 0.475226 0.0 0.0 0.0 1.0 1.0
employment_duration_4 658.0 0.188450 0.391368 0.0 0.0 0.0 0.0 1.0
employment_duration_5 658.0 0.232523 0.422762 0.0 0.0 0.0 0.0 1.0
savings_2 658.0 0.124620 0.330539 0.0 0.0 0.0 0.0 1.0
savings_3 658.0 0.075988 0.265180 0.0 0.0 0.0 0.0 1.0
savings_4 658.0 0.060790 0.239127 0.0 0.0 0.0 0.0 1.0
savings_5 658.0 0.203647 0.403016 0.0 0.0 0.0 0.0 1.0
bad_credit_risk 658.0 0.215805 0.411693 0.0 0.0 0.0 0.0 1.0
In [60]:
# Visualize the relationships between the variables
plt.figure(figsize=(14,12))
sns.heatmap(df.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True,  linecolor='white', annot=True)
plt.show()
In [61]:
# Check how many defaults/non-defaults
df['bad_credit_risk'].value_counts()
Out[61]:
0    516
1    142
Name: bad_credit_risk, dtype: int64

4. Logistic Regression Model

  • Import the necessary sklearn functions
  • Split the data into train/test sets
  • Fit the model on the training data
  • Predict credit risk using Logistic Regression
  • Improve model accuracy through undersampling
In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = np.array(df.drop('bad_credit_risk',axis=1)) # Features
y = np.array(df['bad_credit_risk']) # Target Variable
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=43) # 60/40 split train/test data
In [63]:
logreg = LogisticRegression(solver='lbfgs',max_iter=5000).fit(X_train, np.ravel(y_train))
print(logreg.coef_)
preds = logreg.predict_proba(X_test)
preds_df = pd.DataFrame(preds[:,1][0:5], columns = ['prob_bad_risk'])
true_df = pd.DataFrame(y_test).head()
# Compare true default status vs predicted probablity of default
print(pd.concat([true_df.reset_index(drop=True),preds_df],axis=1)) 
[[ 1.25468310e-04 -2.34110396e-02 -1.64791720e-01 -1.49191627e+00
  -7.88005370e-02  1.25668557e-01 -1.61451257e-01 -1.08001838e-02
  -3.71044957e-01 -1.12112147e-01 -7.71286852e-02  2.15281663e-01
   4.23456333e-01  5.30103525e-01 -1.00767186e+00 -9.66303906e-02
   1.85112179e-01  1.18659389e-02 -9.66430311e-02 -5.46194601e-01]]
   0  prob_bad_risk
0  0       0.067163
1  0       0.175581
2  0       0.030768
3  0       0.064309
4  0       0.210747
In [64]:
# Import functions to evaluate model accuracy
from sklearn.metrics import precision_recall_fscore_support, precision_recall_curve, confusion_matrix, classification_report, roc_curve, roc_auc_score
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_bad_risk'])
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.5 else 0)
print("LogReg Model Predicted loan status given threshold of 0.5")
print(preds_df['loan_status'].value_counts())
target_names = ['Non-Default','Default']
print("Classification Report")
print(classification_report(y_test, preds_df['loan_status'], target_names = target_names))
LogReg Model Predicted loan status given threshold of 0.5
0    238
1     26
Name: loan_status, dtype: int64
Classification Report
              precision    recall  f1-score   support

 Non-Default       0.83      0.92      0.87       215
     Default       0.35      0.18      0.24        49

   micro avg       0.78      0.78      0.78       264
   macro avg       0.59      0.55      0.56       264
weighted avg       0.74      0.78      0.76       264

In [65]:
# Plot the roc curve to illustrate model's diagnostic ability given varying threshold
prob_default = preds[:,1]
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, c='r')
plt.plot([0,1],[0,1],ls='--')
plt.xlabel("Specificity")
plt.ylabel("Sensitivity")
plt.show()
# AUC score measures the model's lift which can be used as a metric to compare competing models
auc = roc_auc_score(y_test, prob_default)
print("AUC Score: ")
print(auc)
AUC Score: 
0.6865685809207405
In [66]:
# Confusion Matrix
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()
In [67]:
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.4 else 0) # lower the threshold value
In [68]:
# After lowering default threshold
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()
In [69]:
# Calculate optimal threshold value
precision, recall, thresholds = precision_recall_curve(y_test, prob_default) 
plt.title("Precision-Recall vs. Threshold")
plt.plot(thresholds, precision[:-1],"b--",label='Precision')
plt.plot(thresholds, recall[:-1],"r--",label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc='lower left')
plt.ylim([0,1])
plt.show()

It is clear that this logistic regression model is not effective. This may be a result of the data processing.

In [70]:
# Undersampling to see if correcting the imbalance in target variable data will improve model
minority_class_len = len(df[df['bad_credit_risk']==1])
majority_class_indices = df[df['bad_credit_risk']==0].index
print(minority_class_len)
print(len(majority_class_indices))
142
516
In [71]:
random_majority_indices = np.random.choice(majority_class_indices, minority_class_len, replace=False)
print(len(random_majority_indices))
minority_class_indices = df[df['bad_credit_risk']==1].index
print(len(minority_class_indices))
142
142
In [72]:
undersample_indices = np.concatenate([minority_class_indices,random_majority_indices])
undersample = df.loc[undersample_indices]
In [73]:
# Re-split the new balanced dataset and fit Log Reg model to new training set
X = undersample.loc[:, df.columns!='bad_credit_risk']
y = undersample.loc[:, df.columns=='bad_credit_risk']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=99)
logreg = LogisticRegression(solver='lbfgs',max_iter=5000).fit(X_train, np.ravel(y_train))
In [74]:
preds = logreg.predict_proba(X_test)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_bad_risk'])
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.44 else 0)
print(preds_df['loan_status'].value_counts())
print(classification_report(y_test, preds_df['loan_status'], target_names = target_names))
1    64
0    50
Name: loan_status, dtype: int64
              precision    recall  f1-score   support

 Non-Default       0.70      0.65      0.67        54
     Default       0.70      0.75      0.73        60

   micro avg       0.70      0.70      0.70       114
   macro avg       0.70      0.70      0.70       114
weighted avg       0.70      0.70      0.70       114

The undersampling improved the model's ability to correctly predict defaults as shown by increase in default recall

In [75]:
# Plot new roc curve and compare auc values
prob_default = preds_df['prob_bad_risk']
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, c='r')
plt.plot([0,1],[0,1],ls='--')
plt.show()
auc = roc_auc_score(y_test, prob_default)
print(auc)
0.7777777777777778

Improved AUC value shows model improvement. This is supported below by the increased precision for same threshold.

In [76]:
precision, recall, thresholds = precision_recall_curve(y_test, prob_default) 
plt.title("Precision-Recall vs. Threshold")
plt.plot(thresholds, precision[:-1],"b--",label='Precision')
plt.plot(thresholds, recall[:-1],"r--",label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc='lower left')
plt.ylim([0,1])
plt.show()
In [77]:
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()

5. Gradient Boosted Trees Model

  • Utilize xgboost to fit a gradient boosted tree model to the undersampled dataset
  • Compare model performance between Logistic Regression (LR) and Gradient Boosted Trees (GBT)
In [78]:
import xgboost as xgb
gbt = xgb.XGBClassifier().fit(X_train,np.ravel(y_train))
gbt_preds = gbt.predict_proba(X_test)
gbt_preds_df = pd.DataFrame(gbt_preds[:,1], columns = ['prob_bad_risk'])
gbt_preds_df['loan_status'] = gbt_preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.43 else 0)
print(gbt_preds_df['prob_bad_risk'].describe())
print(gbt_preds_df['loan_status'].value_counts())
count    114.000000
mean       0.485580
std        0.372824
min        0.001628
25%        0.085065
50%        0.465355
75%        0.868035
max        0.999326
Name: prob_bad_risk, dtype: float64
1    61
0    53
Name: loan_status, dtype: int64
In [79]:
portfolio = pd.concat([pd.DataFrame(gbt_preds[:,1],columns=['gbt_prob_default']), pd.DataFrame(preds[:,1],columns=['lr_prob_default']) , df['amount'].reset_index(drop=True)], axis=1)
portfolio.head()
Out[79]:
gbt_prob_default lr_prob_default amount
0 0.008281 0.112300 841
1 0.006876 0.369631 1098
2 0.940440 0.415696 3758
3 0.921755 0.405789 7582
4 0.938684 0.762825 3213
In [80]:
# Assuming Loss Given Default is 20% -> Calculate the Expected Loss for each method
portfolio['gbt_exp_loss'] = portfolio['gbt_prob_default'] * portfolio['amount'] * 0.2
portfolio['lr_exp_loss'] = portfolio['lr_prob_default'] * portfolio['amount'] * 0.2
print("GBT Expected Loss: ", np.sum(portfolio['gbt_exp_loss']))
print("LR Expected Loss: ", np.sum(portfolio['lr_exp_loss']))
GBT Expected Loss:  31375.72472964069
LR Expected Loss:  30216.3322933652
In [81]:
gbt_preds = gbt.predict(X_test)
print(classification_report(y_test, gbt_preds, target_names=target_names)) 
              precision    recall  f1-score   support

 Non-Default       0.61      0.67      0.64        54
     Default       0.67      0.62      0.64        60

   micro avg       0.64      0.64      0.64       114
   macro avg       0.64      0.64      0.64       114
weighted avg       0.64      0.64      0.64       114

In [82]:
print(gbt.get_booster().get_score(importance_type = 'weight'))
xgb.plot_importance(gbt, importance_type = 'weight')
{'Status_4': 38, 'age': 225, 'employment_duration_4': 31, 'employment_duration_5': 7, 'job_3': 30, 'amount': 362, 'Credit_History_3': 5, 'employment_duration_3': 30, 'savings_5': 14, 'employment_duration_2': 16, 'job_4': 1, 'Status_3': 2, 'savings_3': 1, 'Credit_History_4': 26, 'savings_2': 14, 'Credit_History_2': 39, 'Housing_2': 11, 'job_2': 3}
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x19751344d68>
In [83]:
X2 = np.array(X.drop(['savings_5','savings_3','Credit_History_3','employment_duration_5'
                     ,'Status_3'],axis=1)) # Remove features w/ low weighting (<=5)
y = undersample.loc[:, df.columns=='bad_credit_risk']
X2_train, X2_test, y_train, y_test = train_test_split(X2,y,test_size=0.4,random_state=99)
In [84]:
gbt2 = xgb.XGBClassifier().fit(X2_train,np.ravel(y_train))
gbt2_preds = gbt2.predict(X2_test)
In [85]:
print(classification_report(y_test, gbt_preds, target_names=target_names))
print(classification_report(y_test, gbt2_preds, target_names=target_names))
              precision    recall  f1-score   support

 Non-Default       0.61      0.67      0.64        54
     Default       0.67      0.62      0.64        60

   micro avg       0.64      0.64      0.64       114
   macro avg       0.64      0.64      0.64       114
weighted avg       0.64      0.64      0.64       114

              precision    recall  f1-score   support

 Non-Default       0.59      0.67      0.63        54
     Default       0.66      0.58      0.62        60

   micro avg       0.62      0.62      0.62       114
   macro avg       0.63      0.62      0.62       114
weighted avg       0.63      0.62      0.62       114

Removing low-weighted features had negative effect on model. Stick to original feature set.

In [86]:
# Cross Validation
n_folds = 4
early_stop = 10
params = {'objective':'binary:logistic', 'seed':43,'eval_metric':'auc'}
DTrain = xgb.DMatrix(X_train, label= y_train)
gbt2_cv_score = xgb.cv(params, DTrain, num_boost_round=500, nfold=n_folds)
print(gbt2_cv_score)
     train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0          0.896869       0.010609       0.639899      0.019303
1          0.930804       0.005823       0.673884      0.022706
2          0.948848       0.005084       0.653413      0.017155
3          0.963190       0.005856       0.661657      0.027401
4          0.973595       0.005830       0.675792      0.023235
5          0.979467       0.005853       0.686123      0.020608
6          0.982069       0.004706       0.691751      0.021203
7          0.985044       0.006775       0.689885      0.015039
8          0.987894       0.003281       0.689873      0.009735
9          0.990538       0.005114       0.691654      0.020933
10         0.992909       0.004785       0.694259      0.017500
11         0.994212       0.003182       0.690382      0.016323
12         0.995940       0.002283       0.684699      0.013867
13         0.996493       0.002290       0.684712      0.015843
14         0.998338       0.001227       0.691903      0.016769
15         0.999076       0.000707       0.700229      0.024277
16         0.999446       0.000440       0.691413      0.027404
17         0.999691       0.000322       0.689859      0.017531
18         0.999630       0.000370       0.687661      0.019146
19         0.999815       0.000206       0.683256      0.020719
20         0.999876       0.000215       0.681559      0.018916
21         0.999815       0.000206       0.681606      0.020612
22         0.999876       0.000215       0.683229      0.020079
23         0.999938       0.000107       0.685408      0.019170
24         0.999938       0.000107       0.687681      0.022150
25         0.999938       0.000107       0.686524      0.020742
26         1.000000       0.000000       0.686572      0.024044
27         1.000000       0.000000       0.686572      0.021311
28         1.000000       0.000000       0.689393      0.022940
29         1.000000       0.000000       0.686018      0.020789
..              ...            ...            ...           ...
470        1.000000       0.000000       0.696531      0.016703
471        1.000000       0.000000       0.696511      0.017051
472        1.000000       0.000000       0.695395      0.016891
473        1.000000       0.000000       0.695943      0.016184
474        1.000000       0.000000       0.695943      0.016184
475        1.000000       0.000000       0.697648      0.016789
476        1.000000       0.000000       0.698216      0.017645
477        1.000000       0.000000       0.697648      0.016789
478        1.000000       0.000000       0.695375      0.015330
479        1.000000       0.000000       0.696511      0.017051
480        1.000000       0.000000       0.696511      0.016036
481        1.000000       0.000000       0.698216      0.017645
482        1.000000       0.000000       0.697080      0.016892
483        1.000000       0.000000       0.697079      0.017930
484        1.000000       0.000000       0.697079      0.017930
485        1.000000       0.000000       0.697628      0.018439
486        1.000000       0.000000       0.697059      0.016387
487        1.000000       0.000000       0.698176      0.016662
488        1.000000       0.000000       0.697628      0.018439
489        1.000000       0.000000       0.696511      0.019323
490        1.000000       0.000000       0.697060      0.018740
491        1.000000       0.000000       0.697080      0.019048
492        1.000000       0.000000       0.697628      0.018439
493        1.000000       0.000000       0.695943      0.018422
494        1.000000       0.000000       0.697060      0.018740
495        1.000000       0.000000       0.695943      0.018422
496        1.000000       0.000000       0.695943      0.019644
497        1.000000       0.000000       0.696511      0.019323
498        1.000000       0.000000       0.696511      0.019323
499        1.000000       0.000000       0.697080      0.019048

[500 rows x 4 columns]
In [87]:
plt.plot(gbt2_cv_score['test-auc-mean'])
plt.title('Test AUC Score Over 500 Iterations')
plt.xlabel('Iteration Number')
plt.ylabel('Test AUC Score')
plt.show()

Overfitting occurs very quickly

In [88]:
from sklearn.model_selection import cross_val_score
learning_rates = [1,0.5,0.1,0.05,0.01,0.005]
cv_score_avg =[]
for num in learning_rates:
    
    gbt2 = xgb.XGBClassifier(learning_rate=num, max_depth=7).fit(X_train,np.ravel(y_train))
    cv_scores = (cross_val_score(gbt2, X_train, np.ravel(y_train), cv=4))
    cv_score_avg.append(cv_scores.mean())
plt.plot(cv_score_avg)
plt.xticks(np.arange(6), learning_rates)
plt.xlabel('Learning Rate')
plt.ylabel('Average Accuracy')
plt.show()
In [89]:
# Refit model with new learning rate
gbt = xgb.XGBClassifier(learning_rate=0.05,max_depth=7).fit(X_train,np.ravel(y_train))
gbt_preds = gbt.predict_proba(X_test)
gbt_preds_df = pd.DataFrame(gbt_preds[:,1], columns = ['prob_bad_risk'])
gbt_preds_df['loan_status'] = gbt_preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.43 else 0)
print(gbt_preds_df['prob_bad_risk'].describe())
print(gbt_preds_df['loan_status'].value_counts())
count    114.000000
mean       0.512861
std        0.293670
min        0.046179
25%        0.254719
50%        0.496205
75%        0.768054
max        0.971870
Name: prob_bad_risk, dtype: float64
1    69
0    45
Name: loan_status, dtype: int64

6. Model Evaluation and Method Comparison

  • Run diagnostics on the different models
  • Compare the performance metrics between the two ML methods
  • Calculate the maximum value from all model configurations in Deutsche Marks
In [90]:
# Model Evaluation and Method Comparison
# Compare LogReg and Gradient Boosted Tree Classification reports
print(classification_report(y_test, preds_df['loan_status'], target_names=target_names))
print(classification_report(y_test, gbt_preds_df['loan_status'], target_names=target_names))
              precision    recall  f1-score   support

 Non-Default       0.70      0.65      0.67        54
     Default       0.70      0.75      0.73        60

   micro avg       0.70      0.70      0.70       114
   macro avg       0.70      0.70      0.70       114
weighted avg       0.70      0.70      0.70       114

              precision    recall  f1-score   support

 Non-Default       0.67      0.56      0.61        54
     Default       0.65      0.75      0.70        60

   micro avg       0.66      0.66      0.66       114
   macro avg       0.66      0.65      0.65       114
weighted avg       0.66      0.66      0.65       114

In [91]:
# ROC chart components
fallout_lr, sensitivity_lr, thresholds_lr = roc_curve(y_test, preds[:,1])
fallout_gbt, sensitivity_gbt, thresholds_gbt = roc_curve(y_test, gbt_preds_df['prob_bad_risk'])

# ROC Chart with both
plt.plot(fallout_lr, sensitivity_lr, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_gbt, sensitivity_gbt, color = 'green', label='%s' % 'GBT')
plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for LR and GBT on the Probability of Default")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()

The lift is greater for the Log Reg model which implies a more accurate model.

In [92]:
# Print the logistic regression AUC with formatting
print("Logistic Regression AUC Score: %0.2f" % roc_auc_score(y_test, preds[:,1]))

# Print the gradient boosted tree AUC with formatting
print("Gradient Boosted Tree AUC Score: %0.2f" % roc_auc_score(y_test, gbt_preds_df['prob_bad_risk']))
Logistic Regression AUC Score: 0.78
Gradient Boosted Tree AUC Score: 0.70
In [93]:
from sklearn.calibration import calibration_curve
fraction_of_positives_lr, mean_predicted_value_lr = calibration_curve(y_test, preds[:,1], n_bins=10)
fraction_of_positives_gbt, mean_predicted_value_gbt = calibration_curve(y_test, gbt_preds_df['prob_bad_risk'], n_bins=10)
In [94]:
# Create the calibration curve plot with the guideline
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')    
plt.plot(mean_predicted_value_lr, fraction_of_positives_lr,
         's-', label='%s' % 'Logistic Regression')
plt.plot(mean_predicted_value_gbt, fraction_of_positives_gbt,
         's-', label='%s' % 'Gradient Boosted tree')
plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()

The Log Reg model tracks the perfect calibration line much closer than the GBT model - especially for lower probabilities. Note it is more costly for the model to be above the calibration line than below it. This is because when the model is above the calibration line, the percentage of defaults is greater than the model's predicted percentage of defaults and so we are approving loans that will be defaulted on. When the model is below the calibration line it means we are not approving enough loans and are missing out on profit but are not taking on any extra risk.

In [95]:
test_pred_df_gbt = pd.concat([y_test.reset_index(drop=True), gbt_preds_df], axis=1)
test_pred_df_lr = pd.concat([y_test.reset_index(drop=True), preds_df], axis=1)
In [96]:
# Calculate bad rate for each method - bad rate is percentage of accepted loans that defaulted on
accepted_loans_gbt = test_pred_df_gbt[test_pred_df_gbt['loan_status']==0]
accepted_loans_lr = test_pred_df_lr[test_pred_df_lr['loan_status']==0]
print("GBT Model Bad Rate:")
print(np.sum(accepted_loans_gbt['bad_credit_risk']) / accepted_loans_gbt['bad_credit_risk'].count())
print("LR Model Bad Rate:")
print(np.sum(accepted_loans_lr['bad_credit_risk']) / accepted_loans_lr['bad_credit_risk'].count())
GBT Model Bad Rate:
0.3333333333333333
LR Model Bad Rate:
0.3

This tells us that 30% of loans the Logistic Regression model approves will be defaulted on. This is not very good but it is better than the GBT model's 33% bad rate. Therefore we continue with just the LR model.

In [97]:
test_pred_df_lr.head()
Out[97]:
bad_credit_risk prob_bad_risk loan_status
0 0 0.112300 0
1 0 0.369631 0
2 1 0.415696 0
3 0 0.405789 0
4 1 0.762825 1
In [98]:
accept_rates = [1.0,0.95,0.9,0.85,0.8,0.75,0.7,0.65,0.6,0.55,0.5,0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05]
thresholds = []
bad_rates = []
num_accepted = []
# Populate the arrays for the strategy table with a for loop
for rate in accept_rates:
    # Calculate the threshold for the acceptance rate
    thresh = np.quantile(preds_df['prob_bad_risk'], rate).round(3)
    # Add the threshold value to the list of thresholds
    thresholds.append(np.quantile(preds_df['prob_bad_risk'], rate).round(3))
    # Reassign the loan_status value using the threshold
    test_pred_df_lr['loan_status'] = test_pred_df_lr['prob_bad_risk'].apply(lambda x: 1 if x > thresh else 0)
    # Create a set of accepted loans using this acceptance rate
    accepted_loans_lr = test_pred_df_lr[test_pred_df_lr['loan_status'] == 0]
    # Calculate and append the number of accepted loans for chosen threshold
    num_accepted.append(len(test_pred_df_lr[test_pred_df_lr['prob_bad_risk'] < thresh]))
    # Calculate and append the bad rate using the acceptance rate
    bad_rates.append(np.sum((accepted_loans_lr['bad_credit_risk']) / len(accepted_loans_lr['bad_credit_risk'])).round(3))
In [99]:
# Create a data frame of the strategy table
df1 = pd.read_csv("C:/Users/camer/OneDrive/Desktop/Python_Fin_Projects/SouthGermanCredit.asc",sep=' ')
df1.columns = list1
df1.drop(df1[df1.status == 1].index, inplace=True)
df1.drop(df1[df1.housing == 3].index, inplace=True)
avg_loan_amnt = np.mean(df1['amount'])

strat_df = pd.DataFrame(list(zip(accept_rates, thresholds, bad_rates, num_accepted)), 
                        columns = ['Acceptance Rate','Threshold','Bad Rate','# of Loans Accepted'])
strat_df['Avg Loan Amount'] = avg_loan_amnt
strat_df['Estimated Value'] = ((strat_df['# of Loans Accepted'] * (1 - strat_df['Bad Rate'])) * strat_df['Avg Loan Amount']) - (strat_df['# of Loans Accepted'] * strat_df['Bad Rate'] * strat_df['Avg Loan Amount'])

# Print the entire table
print(strat_df)
    Acceptance Rate  Threshold  Bad Rate  # of Loans Accepted  \
0              1.00      0.924     0.522                  113   
1              0.95      0.805     0.500                  108   
2              0.90      0.748     0.471                  102   
3              0.85      0.696     0.443                   97   
4              0.80      0.667     0.424                   92   
5              0.75      0.635     0.424                   85   
6              0.70      0.613     0.425                   80   
7              0.65      0.569     0.392                   74   
8              0.60      0.544     0.397                   68   
9              0.55      0.523     0.371                   62   
10             0.50      0.486     0.351                   57   
11             0.45      0.445     0.294                   51   
12             0.40      0.408     0.261                   46   
13             0.35      0.370     0.275                   40   
14             0.30      0.339     0.229                   35   
15             0.25      0.331     0.214                   28   
16             0.20      0.268     0.130                   23   
17             0.15      0.214     0.111                   18   
18             0.10      0.165     0.083                   12   
19             0.05      0.116     0.167                    6   

    Avg Loan Amount  Estimated Value  
0       3123.655015    -15530.812736  
1       3123.655015         0.000000  
2       3123.655015     18479.543070  
3       3123.655015     34541.377158  
4       3123.655015     43681.191733  
5       3123.655015     40357.622796  
6       3123.655015     37483.860182  
7       3123.655015     49928.501763  
8       3123.655015     43756.159453  
9       3123.655015     49965.985623  
10      3123.655015     53058.404088  
11      3123.655015     65634.239179  
12      3123.655015     68682.926474  
13      3123.655015     56225.790274  
14      3123.655015     59255.735638  
15      3123.655015     50028.458723  
16      3123.655015     53164.608359  
17      3123.655015     43743.664833  
18      3123.655015     31261.539392  
19      3123.655015     12482.125441  
In [100]:
# Plot the strategy curve
plt.plot(strat_df['Acceptance Rate'], strat_df['Bad Rate'])
plt.xlabel('Acceptance Rate')
plt.ylabel('Bad Rate')
plt.title('Acceptance and Bad Rates')
plt.axes().yaxis.grid()
plt.axes().xaxis.grid()
plt.show()
In [101]:
# Create a line plot of estimated value
plt.plot(strat_df['Acceptance Rate'],strat_df['Estimated Value'])
plt.title('Estimated Value by Acceptance Rate')
plt.xlabel('Acceptance Rate')
plt.ylabel('Estimated Value')
plt.axes().yaxis.grid()
plt.show()
In [102]:
# Print the row with the max estimated value
print(strat_df.loc[strat_df['Estimated Value'] == np.max(strat_df['Estimated Value'])])
    Acceptance Rate  Threshold  Bad Rate  # of Loans Accepted  \
12              0.4      0.408     0.261                   46   

    Avg Loan Amount  Estimated Value  
12      3123.655015     68682.926474  

This shows us the most profitable configuration of the loan model. We should have a 40% acceptance rate which implies a default risk threshold of 0.408 and means that 26% of the accepted loans will be defaults. This produces value of 68,683 DM