Exploratory Data Analysis

Preliminary EDA: Total Dataset
Investing Deepdive: Fully Paid vs. Charged Off
Race-Based Statistical Discrimination Analysis

Preliminary EDA: Total Dataset

Our population of interest for the models discussed herein is the combined set of Fully Paid and Charged Off borrowers. Lending club defines these categories as follows:

Charged Off = A loan that has no reasonable expectation of further payments.

Fully Paid = A loan that has fully repaid the principal, plus any interest payments per the agreed interest rate.

To explore the data and train our models, we examine the 2016 Q1 Lending Club data set, with only observations classified as Fully Paid and Charged Off included. The data set contains 60,430 “Fully Paid” observations and 21,114 “Charged Off” observations, for a total of 81,544 observations.

This section focuses on understanding some basic distributional properties of the lending club dataset.

master = pd.read_csv('LC2016Q1_v2_copy.csv', skiprows=0)
master = master.sample(frac=0.1)

We first create a separate CSV file to prepare the data containing racial information about the loans.

races = pd.read_csv('zips_correct_v1.csv')

races['Zip3'] = races['Zip3'].astype(str)
races.iloc[0:3750,0] = '0' + races.iloc[0:3750]['Zip3']

zip_list = list(races['Zip3'])
unique_zip = list(set(zip_list))

white_props = []
black_props = []
asian_props = []
other_props = []
hisp_props = []
nhisp_props = []

for code in unique_zip:
    total_pop = sum([int(num) for num in races.groupby('Zip3')['Population'].apply(list)[code]])
    white_props.append(sum([int(num) for num in races.groupby('Zip3')['popwhite'].apply(list)[code]])/total_pop)
    black_props.append(sum([int(num) for num in races.groupby('Zip3')['popblack'].apply(list)[code]])/total_pop)
    asian_props.append(sum([int(num) for num in races.groupby('Zip3')['popasian'].apply(list)[code]])/total_pop)
    other_props.append(sum([int(num) for num in races.groupby('Zip3')['popother'].apply(list)[code]])/total_pop)
    hisp_props.append(sum([int(num) for num in races.groupby('Zip3')['pophisp'].apply(list)[code]])/total_pop)
    nhisp_props.append(sum([int(num) for num in races.groupby('Zip3')['popnothisp'].apply(list)[code]])/total_pop)

df_ziprace = pd.DataFrame()
df_ziprace['Zip Code'] = unique_zip
df_ziprace['White Proportion'] = white_props
df_ziprace['Black Proportion'] = black_props
df_ziprace['Asian Proportion'] = asian_props
df_ziprace['Other Proportion'] = other_props
df_ziprace['Hispanic Proportion'] = hisp_props
df_ziprace['Non-Hispanic Proportion'] = nhisp_props

import csv
with open("zip_proportions_correct.csv","w") as out:
    out.write(df_ziprace.to_csv())

race = pd.read_csv('zip_proportions_correct.csv')
race = race.drop("Unnamed: 0",axis=1)

warnings.simplefilter(action='ignore', category=FutureWarning)

f, ax = plt.subplots(1,1,figsize=(15,5))
fig = sns.distplot(master["int_rate"],norm_hist=True)
plt.xlabel("Interest Rate")
plt.ylabel('Density')
plt.title("Interest Rates Charged by Lending Club")
plt.show(fig)

png

np.mean(master['int_rate']), np.min(master['int_rate']), np.max(master['int_rate'])

(0.12703458425312952, 0.05, 0.29)

The mean interest rate for this dataset is 12.75%, though it ranges from a minimum of 5% to a maximum of 29%.

f, ax = plt.subplots(1,1,figsize=(15,5))
fig = sns.distplot(master["loan_amnt"],norm_hist=True)
plt.xlabel("Loan Amount")
plt.ylabel('Density')
plt.title("Loan Amounts of Lending Club Users")
plt.show(fig)

png

np.mean(master['loan_amnt']), np.min(master['loan_amnt']), np.max(master['loan_amnt'])

(15315.707014961981, 1000, 40000)

The mean requested loan amount for this dataset is $$$15,325, though it ranges from a minimum of $$$1000 to a maximum of $$$40,000.

f, ax = plt.subplots(1,1,figsize=(15,5))
fig = sns.distplot(master["annual_inc"],bins=1000,norm_hist=True)
plt.xlim(0,250000)
plt.xlabel("Annual Income")
plt.ylabel('Density')
plt.title("Annual Incomes of Lending Club Users")
plt.show(fig)

png

np.mean(master['annual_inc']), np.median(master['annual_inc'])

(79505.07938557764, 68000.0)

The mean income of a lending club applicant for this dataset is $$79,249$, though this mean is upwardly biased by an extremely right-skewed distribution of applicant incomes. The median income is $$68,000$.

Investing Deepdive: Fully Paid vs. Charged Off

From the perspective of the lender, the response variable of interest here is whether a borrower will fully pay his or her loan, or have it considered “charged off”. Lending Club assigns interest rates according to subgrades, which they choose and which borrowers (and lenders) have no say over. Hence, for a given interest rate, the only “edge” an investor can obtain is ascertaining the probability that a borrower fully pays back his loan. This is what our models will focus on.

As a result, the most illustrative data analysis comes in outlining the differences between the fully paid and charged off classes. Our models will exploit these differences to learn which loans are “good” (i.e. will pay fully) and which loans are “bad” (i.e. will be charged off).

fp_num = sum(master['loan_status'] == 'Fully Paid')
cf_num = sum(master['loan_status'] == 'Charged Off')

fp_prop = fp_num/len(master)
cf_prop = cf_num/len(master)

fp_cf_list = [fp_prop, cf_prop]

f = plt.figure(figsize=(5,5))

plt.bar('Loan Status',fp_prop, color = 'green', alpha=0.5)
plt.bar('Loan Status 2',cf_prop, color = 'purple', alpha=0.5)
plt.ylabel('Proportion (%)')

plt.legend(['Fully Paid', 'Charged Off'],bbox_to_anchor=(1.45, 1))

frame = plt.gca()

frame.axes.get_xaxis().set_ticks([])

plt.show()

png

Our dataset contains 74.1$\%$ fully paid observations and 25.9$\%$ charged off observations. The accuracy of our models should be interpreted in this context; any good model must, at minimum, perform better than 74.1$\%$ accuracy on this data - that is, better than a strategy of picking the most frequent result.

df = pd.read_csv('EDA_sample.csv')

for i in df['revol_util'].index:
    if pd.isnull(df['revol_util'][i]):
        df.iloc[i,21] = '0'

df['revol_util'] = df['revol_util'].apply(lambda x: x.strip('%'))

fp = df[df['loan_status'] == 'Fully Paid']
cf = df[df['loan_status'] == 'Charged Off']

fpsg = fp.iloc[:,6]
cfsg = cf.iloc[:,6]

dict_fpsg = fpsg.value_counts()
keys_fpsg = sorted(dict_fpsg.keys())
dict_fpsg = OrderedDict((key,dict_fpsg[key]) for key in keys_fpsg)
dict_fpsg.update((x, y/20000) for x, y in dict_fpsg.items())

dict_cfsg = cfsg.value_counts()
keys_cfsg = sorted(dict_cfsg.keys())
dict_cfsg = OrderedDict((key,dict_cfsg[key]) for key in keys_cfsg)
dict_cfsg.update((x, y/20000) for x, y in dict_cfsg.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_fpsg.keys(), dict_fpsg.values(), color = 'g', alpha = 0.5)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].bar(dict_cfsg.keys(), dict_cfsg.values(), color = 'purple', alpha=0.5)
sp[1].legend(['Charged Off'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, 0, 'Subgrade (sub_grade)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

The distribution of subgrades for charged off is centered slightly farther right (i.e. at worse subgrades) than the distribution of fully paid loans. This is counterintuitive; if lending club’s algorithm for assigning subgrades were truly efficient, we would expect to see very little difference in these two distributions.

from collections import OrderedDict

fpdti = fp.iloc[:]['dti'].dropna()
fpdti = fpdti[fpdti<70]
cfdti = cf.iloc[:]['dti'].dropna()
cfdti = cfdti[cfdti<70]

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].hist(fpdti, color = 'g', alpha = 0.5,bins=20,density=True)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].hist(cfdti, color = 'purple', alpha=0.5,bins=20,density=True)
sp[1].legend(['Charged Off'],prop={'size': 14})

f.text(0.5, 0, 'Monthly-Debt-Payment-to-Income Ratio (dti)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

np.mean(fpdti), np.mean(cfdti)

(18.57223773075185, 21.657858894170648)

Intuition suggests that borrowers with lower debt as a percentage of income are more likely to make good on their loans than borrowers with a higher debt as a percentage of income. Thus, it does make sense that the distribution of charged off applicants is centered slightly to the right of that of fully paid applicants. The mean DTI of fully paid applicants is 18.57$\%$, whereas that of charged off applicants is 21.65$\%$.

fpta = fp.iloc[:]['total_acc'].dropna()
fpta = fpta[fpta<100]
cfta = cf.iloc[:]['total_acc'].dropna()
cfta = cfta[cfta<100]

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].hist(fpta, color = 'g', alpha = 0.5,bins=20,density=True)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].hist(cfta, color = 'purple', alpha=0.5,bins=20,density=True)
sp[1].legend(['Charged Off'],prop={'size': 14})

f.text(0.5, 0, "Credit Lines in Borrower's Credit File (total_acc)", ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

We had two possible hypotheses about the relationship between credit lines and repayment prospects for a borrower. On one hand, having more credit lines suggests that a borrower may currently carry a greater debt burden or carry an even greater one in the future, thereby impairing his ability to make payments. On the other hand, a borrower with more credit lines demonstrates that other institutions and individuals are willing to offer him credit, thereby signalling his credit-worthiness. Furthermore, this borrower will be able to refinance more conveniently than comparable borrowers with fewer credit lines.

fpru = fp.iloc[:]['revol_util'].astype(float)
cfru = cf.iloc[:]['revol_util'].astype(float)

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].hist(fpru, color = 'g', alpha = 0.5,bins=20,density=True)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].hist(cfru, color = 'purple', alpha=0.5,bins=20,density=True)
sp[1].legend(['Charged Off'],prop={'size': 14})

f.text(0.5, 0, "Revolving Line Utilization Rate (revol_util)", ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

np.mean(fpru), np.mean(cfru)

(48.9725650000003, 53.88610499999992)

Lending club defines “revolving line utilization rate” as the amount of credit the borrower is using relative to all available revolving credit.

Investors might invoke similar logic as to that used in the discussion on credit lines above. However, in this plot, we gain a better sense of how much of available credit borrowers under either class was using before they fully paid their loans or stopped making payments. Naturally, one would expect borrowers with greater overall debt burdens to struggle more to make payments than those with lower burdens.

Indeed, the utilization rate is 48.97$\%$ for fully paid borrowers and 53.89$\%$ for charged off borrowers.

fphome = fp.iloc[:,8]
cfhome = cf.iloc[:,8]

dict_fphome = fphome.value_counts()
keys_fphome = sorted(dict_fphome.keys())
dict_fphome = OrderedDict((key,dict_fphome[key]) for key in keys_fphome)
dict_fphome.update((x, y/20000) for x, y in dict_fphome.items())

dict_cfhome = cfhome.value_counts()
keys_cfhome = sorted(dict_cfhome.keys())
dict_cfhome = OrderedDict((key,dict_cfhome[key]) for key in keys_cfhome)
dict_cfhome.update((x, y/20000) for x, y in dict_cfhome.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_fphome.keys(), dict_fphome.values(), color = 'g', alpha = 0.5)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].bar(dict_cfhome.keys(), dict_cfhome.values(), color = 'purple', alpha=0.5)
sp[1].legend(['Charged Off'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, -0.1, 'Home Ownership (home_ownership)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

dict_fphome.keys(), dict_fphome.values(), dict_cfhome.keys(), dict_cfhome.values()

(odict_keys(['MORTGAGE', 'OWN', 'RENT']),
 odict_values([0.5262, 0.11855, 0.35525]),
 odict_keys(['MORTGAGE', 'OWN', 'RENT']),
 odict_values([0.41795, 0.12065, 0.4614]))

The data suggests that mortgages are positively correlated with loan repayment and renting is negatively correlated with loan repayment. It is possible that those with mortgages have more experience with borrowing and paying back debt in general.

Approximately 52.62$\%$ of fully paid applicants borrowed through a mortgage, compared to 41.8$\%$ of charged off applicants.

Conversely, approximately 35.53$\%$ of fully paid applicants rented, compared to 46.14$\%$ of charged off applicants.

fppurp = fp.iloc[:,12]
cfpurp = cf.iloc[:,12]

dict_fppurp = fppurp.value_counts()
keys_fppurp = sorted(dict_fppurp.keys())
dict_fppurp = OrderedDict((key,dict_fppurp[key]) for key in keys_fppurp)
dict_fppurp.update((x, y/20000) for x, y in dict_fppurp.items())

dict_cfpurp = cfpurp.value_counts()
keys_cfpurp = sorted(dict_cfpurp.keys())
dict_cfpurp = OrderedDict((key,dict_cfpurp[key]) for key in keys_cfpurp)
dict_cfpurp.update((x, y/20000) for x, y in dict_cfpurp.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_fppurp.keys(), dict_fppurp.values(), color = 'g', alpha = 0.5)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].bar(dict_cfpurp.keys(), dict_cfpurp.values(), color = 'purple', alpha=0.5)
sp[1].legend(['Charged Off'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, -0.15, 'Loan Purpose (loan_purpose)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

Surprisingly, there does not seem to be a substantial difference in the loan purpose between the two classes.

fpterm = fp.iloc[:,3]
cfterm = cf.iloc[:,3]

dict_fpterm = fpterm.value_counts()
keys_fpterm = sorted(dict_fpterm.keys())
dict_fpterm = OrderedDict((key,dict_fpterm[key]) for key in keys_fpterm)
dict_fpterm.update((x, y/20000) for x, y in dict_fpterm.items())

dict_cfterm = cfterm.value_counts()
keys_cfterm = sorted(dict_cfterm.keys())
dict_cfterm = OrderedDict((key,dict_cfterm[key]) for key in keys_cfterm)
dict_cfterm.update((x, y/20000) for x, y in dict_cfterm.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_fpterm.keys(), dict_fpterm.values(), color = 'g', alpha = 0.5)
sp[0].legend(['Fully Paid'],prop={'size': 14})
sp[1].bar(dict_cfterm.keys(), dict_cfterm.values(), color = 'purple', alpha=0.5)
sp[1].legend(['Charged Off'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, -0.1, 'Loan Term (term)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

Applicants taking longer term loans appear significantly more likely to have their loans charged off. Perhaps there is a selection bias; some individuals who cannot afford to borrow significant amounts may take out longer loans hoping that they will eventually gather the money to pay back the principal.

There is the additional influence of longer-term loans requiring a larger gross sum of interest payment, ceteris paribus, than shorter-term loans. This may result in cash-strapped individuals having less money in the future to repay their principals.

Race-Based Statistical Discrimination Analysis

Does Lending Club statistically discriminate by race? Is a model that seeks to maximize returns and returns alone fated to violate racial parity? These are lofty questions that necessitate as granular a view of the data as one can obtain.

Fortunately for lenders of color, LC does not inquire about race (and is probably legally prohibited from doing so). Unfortunately for equal opportunity investors, however, determining without supplementary data whether Lending Club is racist may very well be impossible.

So, we turned to the 2010 US Census. Thanks to data provided by Proximity One, we accessed the racial demographics for nearly 70,000 area codes in the United States. These zip codes were formatted as five-digit integers. First, we organized the demographic data for each zip code into two sets: (1) proportion white vs. black vs. Asian vs. other (summing to one), and (2) proportion Hispanic vs. non-Hispanic (also summing to one).

Note: Hispanic and non-Hispanic are mutually exclusive, but Hispanic and white (or any of black, Asian, or other) are not mutually exclusive.

Lending Club only provides the first three digits of a borrower’s zip code in its released financials. Because zip codes are organized by order of digits, we were able to drop the trailing two digits of every zip code in the census data. This resulted in duplicate three-digit zip codes (formerly representing unique five-digit zip codes), each with its own demographic proportions. These proportions were normalized by population and averaged across all duplicate three-digit zip codes in order to produce an overarching proportion of each demographic for each unique three-digit zip code.

Using .groupby() and other Pandas functions, each of the 80,000 observations in the Lending Club dataset could then be paired with the respective demographic information of its zip code. The proportion of each race in a zip code could be thought of as the probability that the observation belongs to that race, with all demographic proportions forming a probability space that sums to one. Of course, these probabilities represent only estimates – but we believe they are the most accurate estimates possible with public data.

This is not a perfect analysis; there may very well be selection biases at play that make the true probability an observation belongs to a certain race very different from the proportion of that race present in the observation’s zip code. For example, there is a wage gap currently present between racial minority groups and whites in the United States; as a result of these lower incomes, certain racial minority groups may be less aware of the lending options available to them, such as Lending Club.

subgrades = list(set(df['sub_grade']))

mhz = list(race[race['Hispanic Proportion'] > 50]['Zip Code'])
mhsg = master[master['zip_code'].isin(mhz)]['sub_grade']

mwz = list(race[race['White Proportion'] > 50]['Zip Code'])
mwsg = master[master['zip_code'].isin(mwz)]['sub_grade']

dict_mhsg = mhsg.value_counts()

for i in range(0,len(subgrades)):
    if subgrades[i] in dict_mhsg.keys():
        continue
    else:
        dict_mhsg[subgrades[i]] = 0

keys_mhsg = sorted(dict_mhsg.keys())
dict_mhsg = OrderedDict((key,dict_mhsg[key]) for key in keys_mhsg)
dict_mhsg.update((x, y/len(mhsg)) for x, y in dict_mhsg.items())

dict_mwsg = mwsg.value_counts()
keys_mwsg = sorted(dict_mwsg.keys())
dict_mwsg = OrderedDict((key,dict_mwsg[key]) for key in keys_mwsg)
dict_mwsg.update((x, y/len(mwsg)) for x, y in dict_mwsg.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_mhsg.keys(), dict_mhsg.values(), color = 'blue', alpha = 0.5)
sp[0].legend(['Majority Hispanic Communities'],prop={'size': 14})
sp[1].bar(dict_mwsg.keys(), dict_mwsg.values(), color = 'grey', alpha=0.5)
sp[1].legend(['Majority White Communities'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, 0, 'Subgrade (sub_grade)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

Majority Hispanic communities are defined here as those where the proportion of individuals identifying as Hispanic exceeds 50$\%$. Majority White communities are defined similarly.

The distributions of subgrades for the two groups are very similar. This may be in part due to the fact that the census dataset does not consider the proportion of white individuals in a community as mutually exclusive from the proportion of Hispanic individuals. “White” and “black,” however, are considered mutually exclusive.

mbz = list(race[race['Black Proportion'] > 50]['Zip Code'])
mbsg = master[master['zip_code'].isin(mbz)]['sub_grade']

dict_mbsg = mbsg.value_counts()

for i in range(0,len(subgrades)):
    if subgrades[i] in dict_mbsg.keys():
        continue
    else:
        dict_mbsg[subgrades[i]] = 0

keys_mbsg = sorted(dict_mbsg.keys())
dict_mbsg = OrderedDict((key,dict_mbsg[key]) for key in keys_mbsg)
dict_mbsg.update((x, y/len(mbsg)) for x, y in dict_mbsg.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_mbsg.keys(), dict_mbsg.values(), color = 'orange', alpha = 0.5)
sp[0].legend(['Majority Black Communities'],prop={'size': 14})
sp[1].bar(dict_mwsg.keys(), dict_mwsg.values(), color = 'grey', alpha=0.5)
sp[1].legend(['Majority White Communities'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, 0, 'Subgrade (sub_grade)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

The distribution of subgrades for borrowers from majority-black communities has skinnier tails than the distribution for majority-white communities, though both distributions appear to be roughly centered between B and C ratings.

maz = list(race[race['Asian Proportion'] > 50]['Zip Code'])
masg = master[master['zip_code'].isin(maz)]['sub_grade']

dict_masg = masg.value_counts()

for i in range(0,len(subgrades)):
    if subgrades[i] in dict_masg.keys():
        continue
    else:
        dict_masg[subgrades[i]] = 0

keys_masg = sorted(dict_masg.keys())
dict_masg = OrderedDict((key,dict_masg[key]) for key in keys_masg)
dict_masg.update((x, y/len(masg)) for x, y in dict_masg.items())

f, sp = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(20,7))

sp[0].bar(dict_masg.keys(), dict_masg.values(), color = 'red', alpha = 0.5)
sp[0].legend(['Majority Asian Communities'],prop={'size': 14})
sp[1].bar(dict_mwsg.keys(), dict_mwsg.values(), color = 'grey', alpha=0.5)
sp[1].legend(['Majority White Communities'],prop={'size': 14})

for tick in sp[0].get_xticklabels():
    tick.set_rotation(60)
for tick in sp[1].get_xticklabels():
    tick.set_rotation(60)

f.text(0.5, 0, 'Subgrade (sub_grade)', ha='center',size=15)
f.text(0.08, 0.5, 'Proportion (%)', va='center', rotation='vertical',size=15)

plt.show()

png

There are not a sufficient amount of data points in the distribution of applicants from majority-Asian communities (there are only two such communities), but it appears that the distribution is centered in more or less the same place as that of applicants from majority-white communities.

from sklearn.neighbors.kde import KernelDensity

maj_asian_df = master[master['zip_code'].isin(maz)]
maj_hisp_df = master[master['zip_code'].isin(mhz)]
maj_black_df = master[master['zip_code'].isin(mbz)]
maj_white_df = master[master['zip_code'].isin(mwz)]

X_asian = (np.array(maj_asian_df['int_rate'])*100).reshape(-1,1)
kde_asian = KernelDensity(kernel='gaussian', bandwidth=3).fit(X_asian)

X_hispanic = (np.array(maj_hisp_df['int_rate'])*100).reshape(-1,1)
kde_hispanic = KernelDensity(kernel='gaussian', bandwidth=3).fit(X_hispanic)

X_black = (np.array(maj_black_df['int_rate'])*100).reshape(-1,1)
kde_black = KernelDensity(kernel='gaussian', bandwidth=3).fit(X_black)

X_white = (np.array(maj_white_df['int_rate'])*100).reshape(-1,1)
kde_white = KernelDensity(kernel='gaussian', bandwidth=3).fit(X_white)

X_plot = np.linspace(0, 50, 1000)[:, np.newaxis]
log_dens_asian = kde_asian.score_samples(X_plot)
log_dens_hispanic = kde_hispanic.score_samples(X_plot)
log_dens_black = kde_black.score_samples(X_plot)
log_dens_white = kde_white.score_samples(X_plot)

X_asian_dti = (np.array(maj_asian_df['dti']))
X_asian_dti = X_asian_dti[~np.isnan(X_asian_dti)].reshape(-1,1)
kde_asian_dti = KernelDensity(kernel='gaussian', bandwidth=4).fit(X_asian_dti)

X_hispanic_dti = (np.array(maj_hisp_df['dti']))
X_hispanic_dti = X_hispanic_dti[~np.isnan(X_hispanic_dti)].reshape(-1,1)
kde_hispanic_dti = KernelDensity(kernel='gaussian', bandwidth=4).fit(X_hispanic_dti)

X_black_dti = (np.array(maj_black_df['dti']))
X_black_dti = X_black_dti[~np.isnan(X_black_dti)].reshape(-1,1)
kde_black_dti = KernelDensity(kernel='gaussian', bandwidth=4).fit(X_black_dti)

X_white_dti = (np.array(maj_white_df['dti']))
X_white_dti = X_white_dti[~np.isnan(X_white_dti)].reshape(-1,1)
kde_white_dti = KernelDensity(kernel='gaussian', bandwidth=4).fit(X_white_dti)

X_plot_2 = np.linspace(0, 70, 5000)[:, np.newaxis]
log_dens_asian_dti = kde_asian_dti.score_samples(X_plot_2)
log_dens_hispanic_dti = kde_hispanic_dti.score_samples(X_plot_2)
log_dens_black_dti = kde_black_dti.score_samples(X_plot_2)
log_dens_white_dti = kde_white_dti.score_samples(X_plot_2)

f, (ax1,ax2) = plt.subplots(2, 1, figsize=(15,10))

ax1.plot(X_plot[:, 0], np.exp(log_dens_asian), '-', color = 'red')
ax1.plot(X_plot[:, 0], np.exp(log_dens_hispanic), '-', color = 'blue')
ax1.plot(X_plot[:, 0], np.exp(log_dens_black), '-', color = 'orange')
ax1.plot(X_plot[:, 0], np.exp(log_dens_white), '-', color = 'grey')
ax1.set_xlabel("Interest Rate (%)",size=12)
ax1.set_xlim(0, 30)
ax1.legend(['Majority Asian Communities','Majority Hispanic Communities',
            'Majority Black Communities','Majority White Communities'],loc='upper left')


ax2.plot(X_plot_2[:, 0], np.exp(log_dens_asian_dti), '--', color = 'red')
ax2.plot(X_plot_2[:, 0], np.exp(log_dens_hispanic_dti), '--', color = 'blue')
ax2.plot(X_plot_2[:, 0], np.exp(log_dens_black_dti), '--', color = 'orange')
ax2.plot(X_plot_2[:, 0], np.exp(log_dens_white_dti), '--', color = 'grey')
ax2.set_xlabel("Monthly-Debt-Payment-to-Income Ratio (%)",size=12)
ax2.set_xlim(0, 50)


plt.show()

png

DTI ratios and interest rate appear to follow similar distributions across racial groups.

Contents

Preliminary EDA: Total Dataset

Investing Deepdive: Fully Paid vs. Charged Off

Race-Based Statistical Discrimination Analysis