Predicting Loan Defaults Using Logistic Regression
We used data from a loan company and logistic regression to predict the odds of loan defaults with several loan characteristics as predictor variables. Different models were evaluated and cross-validated using AIC, AUC, and predicted accuracy. Weighted accuracy was also measured because the loan dataset was a stratified sample. We concluded that the interest rate most accurately predicted the odds of a loan default and that the most useful model was both simplistic and accurate. Research was limited by the variables that were not analyzed, the limited variables the loan dataset contained, and the modeling technique used.
Introduction
Traditionally, loaning has been based on the foundation of trust. Although there were credit reports before 1989, when the FICO Score was created, the money lending process was fairly subjective, and potential borrowers were often judged by how trustworthy their character seemed. Today, lenders are able to use tools like FICO Scores to quantify how trustworthy potential borrowers are, minimizing randomness. All of this is done for one purpose: to determine how likely it is that a given borrower will default a loan.
Predicting default rates is a significant part of moneylending because lenders must predict whether giving out a loan will result in profit or loss. Normally, loans are profitable because of interest, but sometimes a borrower will default, which is both a betrayal of the moneylender's trust and a hazard to the moneylender's business. Thus, it is important that the lender is able to gauge the likelihood of a borrower defaulting before making a loan to him/her.
Given the high number of factors that might affect borrower default rate, it may be infeasible to come up with good estimates heuristically or by hand. The goal of this project is to explore whether or not we can employ statistical and machine learning models to better predict the risk of borrower default. By analyzing variables that describe loans and the financial situations of their borrowers, we may determine key relationships between default rates and a few other variables. Along the way, we will look into key relationships between loan default chances, loan characteristics, and buyer behaviors.
Data Description
For this project, we use anonymized data from a lending company. The data contains historical information on details of the loan itself and characteristics of the lender. Some feature names are also anonymized to protect sensitive information. Of the variables in the original data file, we will target the following variables as points of interest:
- Default: This variable is binary and represents whether or not the buyer defaulted on the loan. Default rates will be the focus of this project because we want to analyze how they could be related to other variables. The data set contains 1,000 loans that had been defaulted and 2,000 that had not. In reality, only around 7% of loans were defaulted on, but we upsample this group to better extract signals on what might lead to loan default.
- Reason: This categorical variable represents the reason the loan was taken out. Several reasons for taking out a loan included for a business, for credit cards, and for an existing debt.
- Amount: This continuous variable represents the amount of money that was taken out as the loan.
- Annual Income: This continuous variable represents the amount of money that the borrower earned last year.
- Interest: This variable represents the amount of interest charged on the loan.
- Term: This variable represents the length of time the loan lasts. In this data set, loan terms are either 3 or 5 years.
- Employment: This variable represents the length of time the borrower has been employed. In this data set, this variable is categorical, ranging from < 1 year to 1 year to 10+ years.
- Credit Balance: This continuous variable represents the amount of money that the borrower spent on credit last year.
- Credit Ratio: This continuous variable is the proportion of credit the borrower has used up to the credit line. Values are expressed as percentages, so the ratio is multiplied by 100. Although credit used up should not surpass the credit line, a few borrowers have credit ratios greater than 100.
v5 is an anonymized continuous variable.
Methods
We want to focus on the impact of different loan/borrower characteristics on the probability of default. Since default is a binary variable—loans are either defaulted or not defaulted—we will use logistic regression to build a model. The formula for logistic regression is
where p is the probability that the target variable is 1 (loan defaulted), and the variables on the right side are predictor variables. Continuous predictor variables contribute one independent variable to the equation, while categorical variables may be slightly more complicated. For example, if given a variable with four categories, one category becomes the base, while the other three contribute three binary, mutually exclusive independent variables.
To evaluate the accuracy of these logistic regression models, we will analyze AUC, AIC, predicted accuracy, and weighted accuracy. AUC measures the area under the ROC Curve; thus, predicting true positives more accurately in the model will maximize it. The Akaike information criterion (AIC) approximates the difference between the predicted model and a true model, so a lower AIC suggests better accuracy.
We will also compare predicted accuracy by calculating the proportion of loans that were accurately predicted to have been defaulted/not defaulted. However, the data set did not accurately reflect the distribution of defaulted loans in reality, since the proportion of defaulted loans in the data set was approximately 33% while the proportion of defaulted loans in reality is approximately 7%. Weighted accuracy accommodates for this imbalance by putting more value in defaulted loans that are predicted accurately.
We will also cross-validate our models to ensure that the model can adapt to different loan data sets. Using a train-test split at an 80:20 ratio will give the model enough data to train with while still leaving some for it to test with. We will also compare the models built with a null, or “coin toss,” model. This model randomly predicts defaults for loans based on the proportion of defaulted loans in the data set. Comparing the null model with other models will help us gauge the impact of predictor variables.
After evaluating different models that used different predictor variables, I noticed that of all the independent variables, interest predicted default rates most accurately. Thus, interest rate was used to predict default rates for all the models included in the results. Other characteristics of loans or borrowers of loans that proved to be useful for predicting default were annual income and loan amount.
Discussion
In this project, I was only able to examine basic predictor variables such as interest and amount. I did not find many more patterns between other variables, but I would be interested in studying the other variables more in-depth. Anonymized variables specifically were mostly skipped over, so future steps could include researching those. I would also be interested in re-analyzing the variables that I did use by splitting them into different categories. Perhaps this would help identify special patterns in the data that were not clear in previous models. Exploring how demographics and cultural background tie in to loan defaulting would also be an interesting extension of this project. Perhaps some cultures emphasize the importance of honor more than others would, thus discouraging using loans and especially defaulting on loans.
Looking at the coefficients for Model 4, I was surprised that increasing the amount of the loan actually causes the odds to decrease in general. This may be because borrowers who take out larger loans are more cautious or plan it out more carefully. For example, amortization schedules for mortgages would tend to be more well-planned than repayment for a loan taken out on impulse. However, if interest is raised as well, then the lowering effect of higher loan amounts on probability of default diminishes. Thus, Model 4 implies that the ideal loan with minimal probability of default would have a large amount and a low interest rate.
It is also interesting that a longer term would cause the odds of defaulting to decrease; perhaps this is because borrowers have more time to pull themselves out of debt. Aside from those two predictor variables, interest, income, and the interaction between amount and interest all are expected because wealthier borrowers are more likely to be able to pay back a loan, and high interest loans are less likely to be paid back.
Generally, I found Model 4, which used the predictor variables of amount, interest, term, income, and an interaction between amount and interest, to perform the best because it balanced simplicity and performance. It was one of the most accurate models I was able to build, but it was not overly complicated, and every predictor variable still had a significant effect on the probability of default. In terms of future research, combining some predictor variables from Model 4 with other predictor variables that were left relatively unexplored could yield a better model.
If possible, using different types of models would also allow for different interpretations of the same variables. Logistic regression models seem to assume predictor variables have a linear or one-directional trend. Interest worked particularly well with the logistic regression models because it had such a linear relationship with default rates. However, most variables such as credit balance or loan amount are often more complicated than that. I would be interested in exploring other types of models that could reflect the more complex nature of predictor variables.
Conclusion
We were able to conclude that the probability of a loan default may be predicted by loan interest rates, loan amount, and borrower income, among other factors. We also proved the credibility of our models with evaluation metrics that measured accuracy and error. The predictor variable that best suited logistic regression was interest because of its linear correlation with default. In order to further improve on this research, different predictor variables or types of models may be examined.