In this project I use the modified version of popular credit dataset called South German Credit which was sourced from UCI's ML repo. The data set contains 1000 observations with 20 predictor variables. I implement machine learning library, scikit-learn, to build and evaluate two methods (Logistic Regression and Gradient Boosted Decision Trees) to predict loan defaults. The attribute coding on this dataset has made some of the predictor variables difficult to work with or irrelevant; therefore, I do not work with all 20 predictor variables.
The predictor variables used in my models are:
amount : This is a continuous variable representing the loan amount in Deutsche Marks
age : This is the age of the customer applying for the loan
Status : This is a categorical variable which represents the status of the customer's checking account.
- When Status = 1: The customer does not have a checking account
- Status = 2: The customer has a checking account balance less than 0DM
- Status = 3: The customer has a checking account balance between 0 and 200DM
- Status = 4: The customer has a checking account balance greater than 200DM
Housing : This is a categorical variable representing the customer's property type.
- When Housing = 1: The customer _rents_ their primary residence
- Housing = 2: The customer _owns_ their primary residence
- Housing = 3: The customer does not pay rent on & does not own their primary residence
Credit_History : This is a categorical variable representing the customer's credit history.
- When Credit_History = 0: The customer has a history of delayed payments
- Credit_History = 1: The customer has a critical account/other credits elsewhere
- Credit_History = 2: The customer has not taken credit in past or credit has been paid back
- Credit_History = 3: The customer's existing credits paid back duly until now
- Credit_History = 4: The customer has paid back all previous credits on time with this bank
job : This is a categorical variable representing the type of job held by the customer.
- When job = 1: The customer is unemployed or an unskilled non-resident
- job = 2: The customer is an unskilled resident
- job = 3: The customer is a skilled employee/official
- job = 4: The customer is a manager/self-employed/highly qualified employee
employment_duration : This is a categorical variable representing the customer's employment duration.
- When employment_duration = 1: The customer is unemployed
- employment_duration = 2: The customer has been employed for less than 1 year
- employment_duration = 3: The customer has been employed between 1 and 4 years
- employment_duration = 4: The customer has been employed between 4 and 7 years
- employment_duration = 5: The customer has been employed for more than 7 years
savings : This is a categorical variable representing the customer's saving account balance.
- When savings = 1: The customer does not have a savings account (or it is unknown)
- savings = 2: The customer has less than 100DM in their savings
- savings = 3: The customer has between 100DM and 500DM in their savings
- savings = 4: The customer has between 500DM and 1000DM in their savings
- savings = 5: The customer has more than 1000DM in their savings
credit_risk : This is a binary variable representing whether or not the customer is considered a good risk.
- When credit_risk = 0: The customer is a bad credit risk (default)
- credit_risk = 1: The customer is a good credit risk (non-default)
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# Importing the data
df = pd.read_csv("C:/Users/camer/OneDrive/Desktop/Python_Fin_Projects/SouthGermanCredit.asc",sep=' ')
# Begin inspecting the data
print(df.dtypes)
df.head()
# All the data is represented as int datatype but headers are in German - I will convert to English using guide on UCI website
list1 = ['status','duration','credit_history','purpose','amount','savings','employment_duration','installment_rate',
'personal_status_sex','other_debtors','present_residence','property','age','other_installment_plans'
,'housing','number_credits','job','people_liable','telephone','foreign_worker','credit_risk']
df.columns = list1
df.head()
# Checking for null values - looks like there are none
sns.heatmap(df.isnull(),cbar=False)
# Plot the distribution of loan amounts
sns.distplot(df.amount,bins='auto',color='b')
# Visualize distribution of loan amount vs age
sns.jointplot(x='amount',y='age',data=df)
sns.distplot(df.age,bins='auto')
# Create a cross table of housing situation, credit risk, and current account balance (status)
print(pd.crosstab(df['housing'], df['credit_risk'], df['status'], aggfunc='mean'))
print(df['status'].value_counts())
sns.countplot(x='status',hue='credit_risk',data=df)
# Modify the features so they are all binary
df.drop(df[df.status == 1].index, inplace=True) # Analysing only people with checking accounts
df.drop(df[df.housing == 3].index, inplace=True) # Analysing only people who rent or own a property.
dummy_col = ['status','housing','credit_history','job','employment_duration','savings']
for col in dummy_col:
df = df.merge(pd.get_dummies(df[col], prefix=col, drop_first=True), left_index=True, right_index=True)
df = df.merge(pd.get_dummies(df.credit_risk, prefix='credit_risk'), left_index=True, right_index=True)
df.rename(columns={'credit_risk_0':'bad_credit_risk'}, inplace=True)
df.describe().transpose()
# Keeping only the dummy variables & continuous (age/amount)
del_col = dummy_col + ['duration','purpose','installment_rate','personal_status_sex','other_debtors','present_residence'
,'property','other_installment_plans','number_credits','people_liable','telephone','foreign_worker'
,'credit_risk','credit_risk_1']
for col in del_col:
del df[col]
df.describe().transpose()
# Visualize the relationships between the variables
plt.figure(figsize=(14,12))
sns.heatmap(df.astype(float).corr(),linewidths=0.1,vmax=1.0,
square=True, linecolor='white', annot=True)
plt.show()
# Check how many defaults/non-defaults
df['bad_credit_risk'].value_counts()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = np.array(df.drop('bad_credit_risk',axis=1)) # Features
y = np.array(df['bad_credit_risk']) # Target Variable
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=43) # 60/40 split train/test data
logreg = LogisticRegression(solver='lbfgs',max_iter=5000).fit(X_train, np.ravel(y_train))
print(logreg.coef_)
preds = logreg.predict_proba(X_test)
preds_df = pd.DataFrame(preds[:,1][0:5], columns = ['prob_bad_risk'])
true_df = pd.DataFrame(y_test).head()
# Compare true default status vs predicted probablity of default
print(pd.concat([true_df.reset_index(drop=True),preds_df],axis=1))
# Import functions to evaluate model accuracy
from sklearn.metrics import precision_recall_fscore_support, precision_recall_curve, confusion_matrix, classification_report, roc_curve, roc_auc_score
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_bad_risk'])
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.5 else 0)
print("LogReg Model Predicted loan status given threshold of 0.5")
print(preds_df['loan_status'].value_counts())
target_names = ['Non-Default','Default']
print("Classification Report")
print(classification_report(y_test, preds_df['loan_status'], target_names = target_names))
# Plot the roc curve to illustrate model's diagnostic ability given varying threshold
prob_default = preds[:,1]
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, c='r')
plt.plot([0,1],[0,1],ls='--')
plt.xlabel("Specificity")
plt.ylabel("Sensitivity")
plt.show()
# AUC score measures the model's lift which can be used as a metric to compare competing models
auc = roc_auc_score(y_test, prob_default)
print("AUC Score: ")
print(auc)
# Confusion Matrix
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.4 else 0) # lower the threshold value
# After lowering default threshold
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()
# Calculate optimal threshold value
precision, recall, thresholds = precision_recall_curve(y_test, prob_default)
plt.title("Precision-Recall vs. Threshold")
plt.plot(thresholds, precision[:-1],"b--",label='Precision')
plt.plot(thresholds, recall[:-1],"r--",label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc='lower left')
plt.ylim([0,1])
plt.show()
# Undersampling to see if correcting the imbalance in target variable data will improve model
minority_class_len = len(df[df['bad_credit_risk']==1])
majority_class_indices = df[df['bad_credit_risk']==0].index
print(minority_class_len)
print(len(majority_class_indices))
random_majority_indices = np.random.choice(majority_class_indices, minority_class_len, replace=False)
print(len(random_majority_indices))
minority_class_indices = df[df['bad_credit_risk']==1].index
print(len(minority_class_indices))
undersample_indices = np.concatenate([minority_class_indices,random_majority_indices])
undersample = df.loc[undersample_indices]
# Re-split the new balanced dataset and fit Log Reg model to new training set
X = undersample.loc[:, df.columns!='bad_credit_risk']
y = undersample.loc[:, df.columns=='bad_credit_risk']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=99)
logreg = LogisticRegression(solver='lbfgs',max_iter=5000).fit(X_train, np.ravel(y_train))
preds = logreg.predict_proba(X_test)
preds_df = pd.DataFrame(preds[:,1], columns = ['prob_bad_risk'])
preds_df['loan_status'] = preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.44 else 0)
print(preds_df['loan_status'].value_counts())
print(classification_report(y_test, preds_df['loan_status'], target_names = target_names))
# Plot new roc curve and compare auc values
prob_default = preds_df['prob_bad_risk']
fallout, sensitivity, thresholds = roc_curve(y_test, prob_default)
plt.plot(fallout, sensitivity, c='r')
plt.plot([0,1],[0,1],ls='--')
plt.show()
auc = roc_auc_score(y_test, prob_default)
print(auc)
precision, recall, thresholds = precision_recall_curve(y_test, prob_default)
plt.title("Precision-Recall vs. Threshold")
plt.plot(thresholds, precision[:-1],"b--",label='Precision')
plt.plot(thresholds, recall[:-1],"r--",label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc='lower left')
plt.ylim([0,1])
plt.show()
conf_mat = sns.heatmap(pd.DataFrame(confusion_matrix(y_test,preds_df['loan_status'])), annot=True, cmap='YlGnBu',fmt='g')
plt.xlabel("Predicted Class")
plt.ylabel("Actual Class")
plt.show()
import xgboost as xgb
gbt = xgb.XGBClassifier().fit(X_train,np.ravel(y_train))
gbt_preds = gbt.predict_proba(X_test)
gbt_preds_df = pd.DataFrame(gbt_preds[:,1], columns = ['prob_bad_risk'])
gbt_preds_df['loan_status'] = gbt_preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.43 else 0)
print(gbt_preds_df['prob_bad_risk'].describe())
print(gbt_preds_df['loan_status'].value_counts())
portfolio = pd.concat([pd.DataFrame(gbt_preds[:,1],columns=['gbt_prob_default']), pd.DataFrame(preds[:,1],columns=['lr_prob_default']) , df['amount'].reset_index(drop=True)], axis=1)
portfolio.head()
# Assuming Loss Given Default is 20% -> Calculate the Expected Loss for each method
portfolio['gbt_exp_loss'] = portfolio['gbt_prob_default'] * portfolio['amount'] * 0.2
portfolio['lr_exp_loss'] = portfolio['lr_prob_default'] * portfolio['amount'] * 0.2
print("GBT Expected Loss: ", np.sum(portfolio['gbt_exp_loss']))
print("LR Expected Loss: ", np.sum(portfolio['lr_exp_loss']))
gbt_preds = gbt.predict(X_test)
print(classification_report(y_test, gbt_preds, target_names=target_names))
print(gbt.get_booster().get_score(importance_type = 'weight'))
xgb.plot_importance(gbt, importance_type = 'weight')
X2 = np.array(X.drop(['savings_5','savings_3','Credit_History_3','employment_duration_5'
,'Status_3'],axis=1)) # Remove features w/ low weighting (<=5)
y = undersample.loc[:, df.columns=='bad_credit_risk']
X2_train, X2_test, y_train, y_test = train_test_split(X2,y,test_size=0.4,random_state=99)
gbt2 = xgb.XGBClassifier().fit(X2_train,np.ravel(y_train))
gbt2_preds = gbt2.predict(X2_test)
print(classification_report(y_test, gbt_preds, target_names=target_names))
print(classification_report(y_test, gbt2_preds, target_names=target_names))
# Cross Validation
n_folds = 4
early_stop = 10
params = {'objective':'binary:logistic', 'seed':43,'eval_metric':'auc'}
DTrain = xgb.DMatrix(X_train, label= y_train)
gbt2_cv_score = xgb.cv(params, DTrain, num_boost_round=500, nfold=n_folds)
print(gbt2_cv_score)
plt.plot(gbt2_cv_score['test-auc-mean'])
plt.title('Test AUC Score Over 500 Iterations')
plt.xlabel('Iteration Number')
plt.ylabel('Test AUC Score')
plt.show()
from sklearn.model_selection import cross_val_score
learning_rates = [1,0.5,0.1,0.05,0.01,0.005]
cv_score_avg =[]
for num in learning_rates:
gbt2 = xgb.XGBClassifier(learning_rate=num, max_depth=7).fit(X_train,np.ravel(y_train))
cv_scores = (cross_val_score(gbt2, X_train, np.ravel(y_train), cv=4))
cv_score_avg.append(cv_scores.mean())
plt.plot(cv_score_avg)
plt.xticks(np.arange(6), learning_rates)
plt.xlabel('Learning Rate')
plt.ylabel('Average Accuracy')
plt.show()
# Refit model with new learning rate
gbt = xgb.XGBClassifier(learning_rate=0.05,max_depth=7).fit(X_train,np.ravel(y_train))
gbt_preds = gbt.predict_proba(X_test)
gbt_preds_df = pd.DataFrame(gbt_preds[:,1], columns = ['prob_bad_risk'])
gbt_preds_df['loan_status'] = gbt_preds_df['prob_bad_risk'].apply(lambda x: 1 if x > 0.43 else 0)
print(gbt_preds_df['prob_bad_risk'].describe())
print(gbt_preds_df['loan_status'].value_counts())
# Model Evaluation and Method Comparison
# Compare LogReg and Gradient Boosted Tree Classification reports
print(classification_report(y_test, preds_df['loan_status'], target_names=target_names))
print(classification_report(y_test, gbt_preds_df['loan_status'], target_names=target_names))
# ROC chart components
fallout_lr, sensitivity_lr, thresholds_lr = roc_curve(y_test, preds[:,1])
fallout_gbt, sensitivity_gbt, thresholds_gbt = roc_curve(y_test, gbt_preds_df['prob_bad_risk'])
# ROC Chart with both
plt.plot(fallout_lr, sensitivity_lr, color = 'blue', label='%s' % 'Logistic Regression')
plt.plot(fallout_gbt, sensitivity_gbt, color = 'green', label='%s' % 'GBT')
plt.plot([0, 1], [0, 1], linestyle='--', label='%s' % 'Random Prediction')
plt.title("ROC Chart for LR and GBT on the Probability of Default")
plt.xlabel('Fall-out')
plt.ylabel('Sensitivity')
plt.legend()
plt.show()
# Print the logistic regression AUC with formatting
print("Logistic Regression AUC Score: %0.2f" % roc_auc_score(y_test, preds[:,1]))
# Print the gradient boosted tree AUC with formatting
print("Gradient Boosted Tree AUC Score: %0.2f" % roc_auc_score(y_test, gbt_preds_df['prob_bad_risk']))
from sklearn.calibration import calibration_curve
fraction_of_positives_lr, mean_predicted_value_lr = calibration_curve(y_test, preds[:,1], n_bins=10)
fraction_of_positives_gbt, mean_predicted_value_gbt = calibration_curve(y_test, gbt_preds_df['prob_bad_risk'], n_bins=10)
# Create the calibration curve plot with the guideline
plt.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')
plt.plot(mean_predicted_value_lr, fraction_of_positives_lr,
's-', label='%s' % 'Logistic Regression')
plt.plot(mean_predicted_value_gbt, fraction_of_positives_gbt,
's-', label='%s' % 'Gradient Boosted tree')
plt.ylabel('Fraction of positives')
plt.xlabel('Average Predicted Probability')
plt.legend()
plt.title('Calibration Curve')
plt.show()
test_pred_df_gbt = pd.concat([y_test.reset_index(drop=True), gbt_preds_df], axis=1)
test_pred_df_lr = pd.concat([y_test.reset_index(drop=True), preds_df], axis=1)
# Calculate bad rate for each method - bad rate is percentage of accepted loans that defaulted on
accepted_loans_gbt = test_pred_df_gbt[test_pred_df_gbt['loan_status']==0]
accepted_loans_lr = test_pred_df_lr[test_pred_df_lr['loan_status']==0]
print("GBT Model Bad Rate:")
print(np.sum(accepted_loans_gbt['bad_credit_risk']) / accepted_loans_gbt['bad_credit_risk'].count())
print("LR Model Bad Rate:")
print(np.sum(accepted_loans_lr['bad_credit_risk']) / accepted_loans_lr['bad_credit_risk'].count())
test_pred_df_lr.head()
accept_rates = [1.0,0.95,0.9,0.85,0.8,0.75,0.7,0.65,0.6,0.55,0.5,0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05]
thresholds = []
bad_rates = []
num_accepted = []
# Populate the arrays for the strategy table with a for loop
for rate in accept_rates:
# Calculate the threshold for the acceptance rate
thresh = np.quantile(preds_df['prob_bad_risk'], rate).round(3)
# Add the threshold value to the list of thresholds
thresholds.append(np.quantile(preds_df['prob_bad_risk'], rate).round(3))
# Reassign the loan_status value using the threshold
test_pred_df_lr['loan_status'] = test_pred_df_lr['prob_bad_risk'].apply(lambda x: 1 if x > thresh else 0)
# Create a set of accepted loans using this acceptance rate
accepted_loans_lr = test_pred_df_lr[test_pred_df_lr['loan_status'] == 0]
# Calculate and append the number of accepted loans for chosen threshold
num_accepted.append(len(test_pred_df_lr[test_pred_df_lr['prob_bad_risk'] < thresh]))
# Calculate and append the bad rate using the acceptance rate
bad_rates.append(np.sum((accepted_loans_lr['bad_credit_risk']) / len(accepted_loans_lr['bad_credit_risk'])).round(3))
# Create a data frame of the strategy table
df1 = pd.read_csv("C:/Users/camer/OneDrive/Desktop/Python_Fin_Projects/SouthGermanCredit.asc",sep=' ')
df1.columns = list1
df1.drop(df1[df1.status == 1].index, inplace=True)
df1.drop(df1[df1.housing == 3].index, inplace=True)
avg_loan_amnt = np.mean(df1['amount'])
strat_df = pd.DataFrame(list(zip(accept_rates, thresholds, bad_rates, num_accepted)),
columns = ['Acceptance Rate','Threshold','Bad Rate','# of Loans Accepted'])
strat_df['Avg Loan Amount'] = avg_loan_amnt
strat_df['Estimated Value'] = ((strat_df['# of Loans Accepted'] * (1 - strat_df['Bad Rate'])) * strat_df['Avg Loan Amount']) - (strat_df['# of Loans Accepted'] * strat_df['Bad Rate'] * strat_df['Avg Loan Amount'])
# Print the entire table
print(strat_df)
# Plot the strategy curve
plt.plot(strat_df['Acceptance Rate'], strat_df['Bad Rate'])
plt.xlabel('Acceptance Rate')
plt.ylabel('Bad Rate')
plt.title('Acceptance and Bad Rates')
plt.axes().yaxis.grid()
plt.axes().xaxis.grid()
plt.show()
# Create a line plot of estimated value
plt.plot(strat_df['Acceptance Rate'],strat_df['Estimated Value'])
plt.title('Estimated Value by Acceptance Rate')
plt.xlabel('Acceptance Rate')
plt.ylabel('Estimated Value')
plt.axes().yaxis.grid()
plt.show()
# Print the row with the max estimated value
print(strat_df.loc[strat_df['Estimated Value'] == np.max(strat_df['Estimated Value'])])