We downloaded the 2016 Q1 Accepted data from Lending Club’s website. This data contained information about all the accepted loans in the first quarter of 2016 with information on an applicant’s demographics and financial background, such as income, zip code, and occupation. The dataset also contained information about the loan amount and the interest rate assigned by Lending Club.
Our first steps included going through the dataset and eliminating columns that represented information that would not be present to investors looking for applications to support. These included columns like whether the loan was Fully Paid
or Charged Off
and how much total payment was received by the specific loan. It was important to remove these columns so that we could view the data as if we were investors looking at the loan applications, and power our investment strategy based off of only these observations.
We made several visualizations of the data, splitting between loans that were Fully Paid
or Charged Off
. These visualizations allowed us to see differences between successful loans and unsuccessful loans.
In order to ascertain the extent of racial discrimination in the model, we calculated the average racial demographics of the test set by averaging the proportion of each demographic for all of the observations. Then, after selecting the best n
loans, we once again calculated the average racial demographics.
For every racial demographic, the racial demographics of the n
“best loans” reflected those of the greater test set within 2 percent. Thus, we conclude with reasonable confidence that our models are choosing loans that reflect the data that has been fed into them and therefore are not statistically discriminating.
That being said, selection bias may still lurk in this dataset. It may very well be possible that certain racial groups are more likely to use Lending Club at a proportion that exceeds their proportional make-up of their local zip code. Our models unfortunately cannot correct for this possible selection bias.
In order to build our models, we first had to clean our dataset completely. This involved imputing any missing data within our data frame and selecting important predictors from the thousands of features available to us. Using a Random Decision Forest Regressor, we were able to select the 120 most significant predictors in the dataset.
Afterwards, we built five models based on the major prediction schemes we learned in class: Logistic, Unlimited Depth Decision Forest, Limited Depth Decision Forest, K Nearest Neighbors, and Neural Network. After optimizing our models, we conclude that the Logistic Model yielded the best results.
Nevertheless, we were able to greatly exceed Lending Club’s existing standards for investment interest returns. Our Logistic Model achieved an average of $12.58\%$ interest returns on selected investment compared to Lending Club’s average of $4-5\%$ returns that they reported on their website.
display(model_summary)
Test Accuracy | Investment Returns | ROC AUC | |
---|---|---|---|
Logistic | 0.735294 | 0.126066 | 0.662668 |
Decision Forest (Unlimited) | 0.700163 | 0.126803 | 0.650907 |
Decision Forest (Limited) | 0.745915 | 0.119016 | 0.663640 |
kNN | 0.684641 | 0.133525 | 0.551456 |
Neural Network | 0.424837 | 0.112951 | 0.554948 |
There are several potential areas in which future work can be done.
One area that we can explore further is whether there are other forms of discrimination besides racial discrimination. Since the data contains information on the applicants’ housing situations, job occupation, and reason for loan, we can perform similar analysis to see if our investment strategy discriminates against certain types of people.
Another area we can explore further is time series analysis on the loans. We can see how the interest rates assigned by Lending Club change over time. These changes will inevitably affect the efficiency of our models and the strategies that we choose to implement.
We could also see if we could improve our models by using ensemble methods introduced towards the end of lecture in CS 109. Because we built many diverse models with the ability to predict well on different subsets of the data, ensembles methods could not only significantly improve our test data accuracy but also provide a significant boost in the efficacy and returns of our model powered investment strategies.