Bureau and Bureau Balance data¶
*bureau.csv* data concerns client's earlier credits from other financial institutions. Some of the credits may be active and some are closed. Each previous (or ongoing) credit has its own row (only one row per credit) in *bureau* dataset. As a single client might have taken other loans from other financial institutions, for each row in the *application_train* data (ie *application_train.csv*) we can have multiple rows in this table. Feature explanations for this dataset are as below.
Feature explanations¶
Bureau table¶
SK_ID_CURR: ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau
SK_BUREAU_ID: Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application)
CREDIT_ACTIVE: Status of the Credit Bureau (CB) reported credits
CREDIT_CURRENCY: Recoded currency of the Credit Bureau credit
DAYS_CREDIT: How many days before current application did client apply for Credit Bureau credit
CREDIT_DAY_OVERDUE: Number of days past due on CB credit at the time of application for related loan in our sample
DAYS_CREDIT_ENDDATE: Remaining duration of CB credit (in days) at the time of application in Home Credit
DAYS_ENDDATE_FACT: Days since CB credit ended at the time of application in Home Credit (only for closed credit)
AMT_CREDIT_MAX_OVERDUE: Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)
CNT_CREDIT_PROLONG: How many times was the Credit Bureau credit prolonged
AMT_CREDIT_SUM: Current credit amount for the Credit Bureau credit
AMT_CREDIT_SUM_DEBT: Current debt on Credit Bureau credit
AMT_CREDIT_SUM_LIMIT: Current credit limit of credit card reported in Credit Bureau
AMT_CREDIT_SUM_OVERDUE: Current amount overdue on Credit Bureau credit
CREDIT_TYPE: Type of Credit Bureau credit (Car, cash,...)
DAYS_CREDIT_UPDATE: How many days before loan application did last information about the Credit Bureau credit come
AMT_ANNUITY: Annuity of the Credit Bureau credit
Bureau Balance table¶
SK_BUREAU_ID: Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date) time only relative to the application
STATUS: Status of Credit Bureau loan during the month
# Last amended: 21st October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective:
# Solving Kaggle problem: Home Credit Default Risk
# Processing bureau and bureau_balance datasets.
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
# (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc
# 1.1 Reduce read data size
# There is a file reducing.py
# in this folder. A class
# in it is used to reduce
# dataframe size
# (Code modified to
# exclude 'category' dtype)
import reducing
# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 One-hot encoding function. Uses pd.get_dummies()
# i) To transform 'object' columns to dummies.
# ii) Treat NaN as one of the categories
# iii) Returns transformed-data and new-columns created
def one_hot_encoder(df, nan_as_category = True):
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df,
columns= categorical_columns,
dummy_na= nan_as_category # Treat NaNs as category
)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 3.1 Some constants
num_rows=None # Implies read all rows
nan_as_category = True # While transforming
# 'object' columns to dummies
# 3.2 Read bureau data first
bureau = pd.read_csv(
'bureau.csv.zip',
nrows = None # Read all rows
)
# 3.2.1 Reduce memory usage by appropriately
# changing data-types per feature:
bureau = reducing.Reducer().reduce(bureau)
reduced df from 222.6203 MB to 165.3284 MB in 1.56 seconds
# 3.2.2 Explore data now
bureau.head(5)
bureau.shape # (rows:17,16,428, cols: 17)
bureau.dtypes
SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
(1716428, 17)
SK_ID_CURR uint32
SK_ID_BUREAU uint32
CREDIT_ACTIVE object
CREDIT_CURRENCY object
DAYS_CREDIT int16
CREDIT_DAY_OVERDUE uint16
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG uint8
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float32
CREDIT_TYPE object
DAYS_CREDIT_UPDATE int32
AMT_ANNUITY float64
dtype: object
# 3.2.3 In all, how many are categoricals?
bureau.dtypes.value_counts()
float64 7
object 3
uint32 2
float32 1
int32 1
uint16 1
uint8 1
int16 1
dtype: int64
# 3.3
bureau.shape # (1716428, 17)
# 3.3.1
# What is the actual number of persons
# who might have taken multiple loans?
bureau['SK_ID_CURR'].nunique() # 305811 -- Many duplicate values exist
# Consider SK_ID_CURR as Foreign Key
# Primary key exists in application_train data
# Primary key: SK_ID_BUREAU
# 3.3.2
# As expected, there are no duplicate values here
bureau['SK_ID_BUREAU'].nunique() # 1716428 -- Unique id for each row
(1716428, 17)
305811
1716428
# 3.4 Summary of active/closed cases from bureau
# We aggregate on these also
bureau['CREDIT_ACTIVE'].value_counts()
Closed 1079273
Active 630607
Sold 6527
Bad debt 21
Name: CREDIT_ACTIVE, dtype: int64
Aggregation¶
bureau_balance will be aggregated and merged with bureau. bureau will then be aggregated and merged with 'application_train' data. bureau will be aggregated in three different ways. This aggregation will be by SK_ID_CURR. Finally, aggregated bureau, called bureau_agg, will be merged with 'application_train' over (SK_ID_CURR).
Aggregation over time is one way to extract behaviour of client. All categorical data is first OneHotEncoded (OHE). What is unique about this OHE is that NaN values are treated as categories.# 4.0 OneHotEncode 'object' types in bureau bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category)# 4.1 bureau.head() bureau.shape # (1716428, 40); 17-->40 print(bureau_cat) # List of added columns
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan CREDIT_TYPE_nan 0 215354 5714462 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 1 215354 5714463 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 ... 0 0 0 0 0 0 0 0 0 0 2 215354 5714464 -203 0 528.0 NaN NaN 0 464323.5 NaN ... 0 0 0 0 0 0 0 0 0 0 3 215354 5714465 -203 0 NaN NaN NaN 0 90000.0 NaN ... 0 0 0 0 0 0 0 0 0 0 4 215354 5714466 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN ... 0 0 0 0 0 0 0 0 0 0 5 rows × 40 columns
(1716428, 40)['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt', 'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan', 'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2', 'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4', 'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)', 'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card', 'CREDIT_TYPE_Interbank credit', 'CREDIT_TYPE_Loan for business development', 'CREDIT_TYPE_Loan for purchase of shares (margin lending)', 'CREDIT_TYPE_Loan for the purchase of equipment', 'CREDIT_TYPE_Loan for working capital replenishment', 'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan', 'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan', 'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan']
bureau_balance¶
It is monthly data about the remaining balance of each one of the previous credits of clients that exist in dataset bureau. Each previous credit is identified by a unique ID, SK_ID_BUREAU, in dataset bureau. Each row in bureau_balance is one month of credit-due (from previous credit), and a single previous credit can have multiple rows, one for each month of the credit length.
In my personal view, it should be in decreasing order. That is, for every person identified by SK_ID_BUREAU, credits should be decreasing each passing month.
# 5.0 Read over bureau_balance data
# and reduce memory usage through
# conversion of data-types:
bb = pd.read_csv('bureau_balance.csv.zip', nrows = None)
bb = reducing.Reducer().reduce(bb)
reduced df from 624.8458 MB to 338.4582 MB in 4.78 seconds
# 5.0.1 Display few rows
bb.head(10)
# 5.0.2 & Compare
bb.shape # (27299925, 3)
bureau.shape # (1716428, 17)
SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
---|---|---|---|
0 | 5715448 | 0 | C |
1 | 5715448 | -1 | C |
2 | 5715448 | -2 | C |
3 | 5715448 | -3 | C |
4 | 5715448 | -4 | C |
5 | 5715448 | -5 | C |
6 | 5715448 | -6 | C |
7 | 5715448 | -7 | C |
8 | 5715448 | -8 | C |
9 | 5715448 | -9 | 0 |
(27299925, 3)
(1716428, 40)
# 5.1 There is just one 'object' column
bb.dtypes.value_counts()
object 1
int8 1
uint32 1
dtype: int64
# 5.2 Is the data about all bureau cases?
# No, it appears it is not for all cases
bb['SK_ID_BUREAU'].nunique() # 817395 << 1716428
817395
# 5.3 Just which cases are present in 'bureau' but absent
# in 'bb'
bb_id_set = set(bb['SK_ID_BUREAU']) # Set of IDs in bb
bureau_id_set = set(bureau['SK_ID_BUREAU']) # Set of IDs in bureau
# 5.4 And here is the difference list.
# How many of them?
list(bureau_id_set - bb_id_set)[:5] # sample [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]
len(bureau_id_set - bb_id_set) # 942074
[6292791, 6292792, 6292793, 6292795, 6292796]
942074
# 5.5 OK. So let us OneHotEncode bb
bb, bb_cat = one_hot_encoder(bb, nan_as_category)
# 5.6 Examine the results
bb.head()
bb.shape # (27299925, 11) ; 3-->11
# 1 (ID) + 1 (numeric) + 9 (dummy)
bb_cat # New columns added
SK_ID_BUREAU | MONTHS_BALANCE | STATUS_0 | STATUS_1 | STATUS_2 | STATUS_3 | STATUS_4 | STATUS_5 | STATUS_C | STATUS_X | STATUS_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5715448 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 5715448 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 5715448 | -2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 5715448 | -3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 5715448 | -4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
(27299925, 11)
['STATUS_0',
'STATUS_1',
'STATUS_2',
'STATUS_3',
'STATUS_4',
'STATUS_5',
'STATUS_C',
'STATUS_X',
'STATUS_nan']
Performing aggregations in bb¶
There is one numeric feature: 'MONTHS_BALANCE'. On this feature we will perform ['min', 'max', 'size']. And on the rest of the features,dummy features, we will perform [mean]. Aggregation is by unique bureau ID, SK_ID_BUREAU. Resulting dataset is called bureau_agg.
# 6.0 Bureau balance: Perform aggregations and merge with bureau.csv
# First prepare a dictionary listing operations to be performed
# on various features:
bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in bb_cat:
bb_aggregations[col] = ['mean']
# 6.0.1
len(bb_aggregations) # 10
10
# 6.1 So what all aggregations to perform column-wise
bb_aggregations
{'MONTHS_BALANCE': ['min', 'max', 'size'],
'STATUS_0': ['mean'],
'STATUS_1': ['mean'],
'STATUS_2': ['mean'],
'STATUS_3': ['mean'],
'STATUS_4': ['mean'],
'STATUS_5': ['mean'],
'STATUS_C': ['mean'],
'STATUS_X': ['mean'],
'STATUS_nan': ['mean']}
# 6.2 Perform aggregations now in bb:
grouped = bb.groupby('SK_ID_BUREAU')
bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
# 6.3
bb_agg.shape # (817395, 12)
bb_agg.columns
# 6.3.1 Note that 'SK_ID_BUREAU'
# the grouping column is
# now table-index
bb_agg.head()
(817395, 12)
MultiIndex([('MONTHS_BALANCE', 'min'),
('MONTHS_BALANCE', 'max'),
('MONTHS_BALANCE', 'size'),
( 'STATUS_0', 'mean'),
( 'STATUS_1', 'mean'),
( 'STATUS_2', 'mean'),
( 'STATUS_3', 'mean'),
( 'STATUS_4', 'mean'),
( 'STATUS_5', 'mean'),
( 'STATUS_C', 'mean'),
( 'STATUS_X', 'mean'),
( 'STATUS_nan', 'mean')],
)
MONTHS_BALANCE | STATUS_0 | STATUS_1 | STATUS_2 | STATUS_3 | STATUS_4 | STATUS_5 | STATUS_C | STATUS_X | STATUS_nan | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | size | mean | mean | mean | mean | mean | mean | mean | mean | mean | |
SK_ID_BUREAU | ||||||||||||
5001709 | -96 | 0 | 97 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.886598 | 0.113402 | 0 |
5001710 | -82 | 0 | 83 | 0.060241 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.578313 | 0.361446 | 0 |
5001711 | -3 | 0 | 4 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.250000 | 0 |
5001712 | -18 | 0 | 19 | 0.526316 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.473684 | 0.000000 | 0 |
5001713 | -21 | 0 | 22 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.000000 | 0 |
# 6.4 Rename bb_agg columns
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])
# 6.4.1
bb_agg.columns.tolist()
bb_agg.head()
['MONTHS_BALANCE_MIN',
'MONTHS_BALANCE_MAX',
'MONTHS_BALANCE_SIZE',
'STATUS_0_MEAN',
'STATUS_1_MEAN',
'STATUS_2_MEAN',
'STATUS_3_MEAN',
'STATUS_4_MEAN',
'STATUS_5_MEAN',
'STATUS_C_MEAN',
'STATUS_X_MEAN',
'STATUS_nan_MEAN']
MONTHS_BALANCE_MIN | MONTHS_BALANCE_MAX | MONTHS_BALANCE_SIZE | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_BUREAU | ||||||||||||
5001709 | -96 | 0 | 97 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.886598 | 0.113402 | 0 |
5001710 | -82 | 0 | 83 | 0.060241 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.578313 | 0.361446 | 0 |
5001711 | -3 | 0 | 4 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.250000 | 0 |
5001712 | -18 | 0 | 19 | 0.526316 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.473684 | 0.000000 | 0 |
5001713 | -21 | 0 | 22 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.000000 | 0 |
# 6.5 Merge aggregated bb with bureau
bureau = bureau.join(
bb_agg,
how='left',
on='SK_ID_BUREAU'
)
# 6.5.1
bureau.head()
bureau.shape # (1716428, 52)
bureau.dtypes
SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | ... | MONTHS_BALANCE_SIZE | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 215354 | 5714462 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 215354 | 5714463 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 215354 | 5714464 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 215354 | 5714465 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 215354 | 5714466 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 52 columns
(1716428, 52)
SK_ID_CURR uint32
SK_ID_BUREAU uint32
DAYS_CREDIT int16
CREDIT_DAY_OVERDUE uint16
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG uint8
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float32
DAYS_CREDIT_UPDATE int32
AMT_ANNUITY float64
CREDIT_ACTIVE_Active uint8
CREDIT_ACTIVE_Bad debt uint8
CREDIT_ACTIVE_Closed uint8
CREDIT_ACTIVE_Sold uint8
CREDIT_ACTIVE_nan uint8
CREDIT_CURRENCY_currency 1 uint8
CREDIT_CURRENCY_currency 2 uint8
CREDIT_CURRENCY_currency 3 uint8
CREDIT_CURRENCY_currency 4 uint8
CREDIT_CURRENCY_nan uint8
CREDIT_TYPE_Another type of loan uint8
CREDIT_TYPE_Car loan uint8
CREDIT_TYPE_Cash loan (non-earmarked) uint8
CREDIT_TYPE_Consumer credit uint8
CREDIT_TYPE_Credit card uint8
CREDIT_TYPE_Interbank credit uint8
CREDIT_TYPE_Loan for business development uint8
CREDIT_TYPE_Loan for purchase of shares (margin lending) uint8
CREDIT_TYPE_Loan for the purchase of equipment uint8
CREDIT_TYPE_Loan for working capital replenishment uint8
CREDIT_TYPE_Microloan uint8
CREDIT_TYPE_Mobile operator loan uint8
CREDIT_TYPE_Mortgage uint8
CREDIT_TYPE_Real estate loan uint8
CREDIT_TYPE_Unknown type of loan uint8
CREDIT_TYPE_nan uint8
MONTHS_BALANCE_MIN float64
MONTHS_BALANCE_MAX float64
MONTHS_BALANCE_SIZE float64
STATUS_0_MEAN float64
STATUS_1_MEAN float64
STATUS_2_MEAN float64
STATUS_3_MEAN float64
STATUS_4_MEAN float64
STATUS_5_MEAN float64
STATUS_C_MEAN float64
STATUS_X_MEAN float64
STATUS_nan_MEAN float64
dtype: object
# 6.5.2 Just for curiosity, what happened
# to those rows in 'bureau' where there
# was no matching record in bb_agg. The list
# of such IDs is:
# [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]
bureau[bureau['SK_ID_BUREAU'] ==6292791]
SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | ... | MONTHS_BALANCE_SIZE | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
941854 | 190757 | 6292791 | -707 | 0 | -300.0 | -300.0 | 0.0 | 0 | 2250000.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 52 columns
# 6.6 Drop SK_ID_BUREAU as bb has finally merged.
bureau.drop(['SK_ID_BUREAU'],
axis=1,
inplace= True
)
# We have three types of columns
# Categorical columns generated from bureau
# Categorical columns generated from bb
# Numerical columns
Performing aggregations in bureau¶
Aggregate 14 original numeric columns, as: ['min', 'max', 'mean', 'var']
Aggregate rest of the columns that is dummy columns as: [mean].
This constitutes one of the three aggretaions. Aggregation is by SK_ID_CURR. Resulting dataset is called bureau_agg
# 7.0 Have a look at bureau again.
# SK_ID_CURR repeats for many cases.
# So, there is a case for aggregation
bureau.shape # (1716428, 51)
bureau.head()
(1716428, 51)
SK_ID_CURR | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | ... | MONTHS_BALANCE_SIZE | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 215354 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 215354 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 215354 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 215354 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 215354 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 51 columns
## Aggregation strategy
# 7.1 Numeric features
# Columns: Bureau + bureau_balance numeric features
# Last three columns are from bureau_balance
# Total: 11 + 3 = 14
num_aggregations = {
'DAYS_CREDIT': ['min', 'max', 'mean', 'var'],
'DAYS_CREDIT_ENDDATE': ['min', 'max', 'mean'],
'DAYS_CREDIT_UPDATE': ['mean'],
'CREDIT_DAY_OVERDUE': ['max', 'mean'],
'AMT_CREDIT_MAX_OVERDUE': ['mean'],
'AMT_CREDIT_SUM': ['max', 'mean', 'sum'],
'AMT_CREDIT_SUM_DEBT': ['max', 'mean', 'sum'],
'AMT_CREDIT_SUM_OVERDUE': ['mean'],
'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
'AMT_ANNUITY': ['max', 'mean'],
'CNT_CREDIT_PROLONG': ['sum'],
'MONTHS_BALANCE_MIN': ['min'],
'MONTHS_BALANCE_MAX': ['max'],
'MONTHS_BALANCE_SIZE': ['mean', 'sum']
}
len(num_aggregations) # 14
14
# 7.2 Bureau categorical features. Derived from:
# 'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE',
# Total:
cat_aggregations = {}
bureau_cat # bureau_cat are newly created dummy columns
# but all are numerical columns
# 7.2.1
len(bureau_cat) # 26
['CREDIT_ACTIVE_Active',
'CREDIT_ACTIVE_Bad debt',
'CREDIT_ACTIVE_Closed',
'CREDIT_ACTIVE_Sold',
'CREDIT_ACTIVE_nan',
'CREDIT_CURRENCY_currency 1',
'CREDIT_CURRENCY_currency 2',
'CREDIT_CURRENCY_currency 3',
'CREDIT_CURRENCY_currency 4',
'CREDIT_CURRENCY_nan',
'CREDIT_TYPE_Another type of loan',
'CREDIT_TYPE_Car loan',
'CREDIT_TYPE_Cash loan (non-earmarked)',
'CREDIT_TYPE_Consumer credit',
'CREDIT_TYPE_Credit card',
'CREDIT_TYPE_Interbank credit',
'CREDIT_TYPE_Loan for business development',
'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
'CREDIT_TYPE_Loan for the purchase of equipment',
'CREDIT_TYPE_Loan for working capital replenishment',
'CREDIT_TYPE_Microloan',
'CREDIT_TYPE_Mobile operator loan',
'CREDIT_TYPE_Mortgage',
'CREDIT_TYPE_Real estate loan',
'CREDIT_TYPE_Unknown type of loan',
'CREDIT_TYPE_nan']
26
# 7.2.2 For all these new dummy columns in bureau, we will
# take mean
for cat in bureau_cat: cat_aggregations[cat] = ['mean']
cat_aggregations
len(cat_aggregations) # 26
{'CREDIT_ACTIVE_Active': ['mean'],
'CREDIT_ACTIVE_Bad debt': ['mean'],
'CREDIT_ACTIVE_Closed': ['mean'],
'CREDIT_ACTIVE_Sold': ['mean'],
'CREDIT_ACTIVE_nan': ['mean'],
'CREDIT_CURRENCY_currency 1': ['mean'],
'CREDIT_CURRENCY_currency 2': ['mean'],
'CREDIT_CURRENCY_currency 3': ['mean'],
'CREDIT_CURRENCY_currency 4': ['mean'],
'CREDIT_CURRENCY_nan': ['mean'],
'CREDIT_TYPE_Another type of loan': ['mean'],
'CREDIT_TYPE_Car loan': ['mean'],
'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
'CREDIT_TYPE_Consumer credit': ['mean'],
'CREDIT_TYPE_Credit card': ['mean'],
'CREDIT_TYPE_Interbank credit': ['mean'],
'CREDIT_TYPE_Loan for business development': ['mean'],
'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
'CREDIT_TYPE_Microloan': ['mean'],
'CREDIT_TYPE_Mobile operator loan': ['mean'],
'CREDIT_TYPE_Mortgage': ['mean'],
'CREDIT_TYPE_Real estate loan': ['mean'],
'CREDIT_TYPE_Unknown type of loan': ['mean'],
'CREDIT_TYPE_nan': ['mean']}
26
# 7.3.1 In addition, we have in bureau. columns that merged
# from 'bb' ie bb_cat
# So here is our full list
bb_cat
len(bb_cat) # 9
# 7.3.2
for cat in bb_cat: cat_aggregations[cat + "_MEAN"] = ['mean']
cat_aggregations
len(cat_aggregations) # 26 + 9 = 35
['STATUS_0',
'STATUS_1',
'STATUS_2',
'STATUS_3',
'STATUS_4',
'STATUS_5',
'STATUS_C',
'STATUS_X',
'STATUS_nan']
9
{'CREDIT_ACTIVE_Active': ['mean'],
'CREDIT_ACTIVE_Bad debt': ['mean'],
'CREDIT_ACTIVE_Closed': ['mean'],
'CREDIT_ACTIVE_Sold': ['mean'],
'CREDIT_ACTIVE_nan': ['mean'],
'CREDIT_CURRENCY_currency 1': ['mean'],
'CREDIT_CURRENCY_currency 2': ['mean'],
'CREDIT_CURRENCY_currency 3': ['mean'],
'CREDIT_CURRENCY_currency 4': ['mean'],
'CREDIT_CURRENCY_nan': ['mean'],
'CREDIT_TYPE_Another type of loan': ['mean'],
'CREDIT_TYPE_Car loan': ['mean'],
'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
'CREDIT_TYPE_Consumer credit': ['mean'],
'CREDIT_TYPE_Credit card': ['mean'],
'CREDIT_TYPE_Interbank credit': ['mean'],
'CREDIT_TYPE_Loan for business development': ['mean'],
'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
'CREDIT_TYPE_Microloan': ['mean'],
'CREDIT_TYPE_Mobile operator loan': ['mean'],
'CREDIT_TYPE_Mortgage': ['mean'],
'CREDIT_TYPE_Real estate loan': ['mean'],
'CREDIT_TYPE_Unknown type of loan': ['mean'],
'CREDIT_TYPE_nan': ['mean'],
'STATUS_0_MEAN': ['mean'],
'STATUS_1_MEAN': ['mean'],
'STATUS_2_MEAN': ['mean'],
'STATUS_3_MEAN': ['mean'],
'STATUS_4_MEAN': ['mean'],
'STATUS_5_MEAN': ['mean'],
'STATUS_C_MEAN': ['mean'],
'STATUS_X_MEAN': ['mean'],
'STATUS_nan_MEAN': ['mean']}
35
# 7.4 Have a look at bureau columns again
# Just to compare above results with what
# already exists
bureau.columns # 51
len(bureau.columns) # 35 (dummy) + 14 (num) + 1 (SK_ID_CURR) + 1 (DAYS_ENDDATE_FACT) = 51
Index(['SK_ID_CURR', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE',
'AMT_ANNUITY', 'CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan',
'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan',
'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)',
'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card',
'CREDIT_TYPE_Interbank credit',
'CREDIT_TYPE_Loan for business development',
'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
'CREDIT_TYPE_Loan for the purchase of equipment',
'CREDIT_TYPE_Loan for working capital replenishment',
'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan',
'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan',
'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan',
'MONTHS_BALANCE_MIN', 'MONTHS_BALANCE_MAX', 'MONTHS_BALANCE_SIZE',
'STATUS_0_MEAN', 'STATUS_1_MEAN', 'STATUS_2_MEAN', 'STATUS_3_MEAN',
'STATUS_4_MEAN', 'STATUS_5_MEAN', 'STATUS_C_MEAN', 'STATUS_X_MEAN',
'STATUS_nan_MEAN'],
dtype='object')
51
# 7.5 Now that we have decided
# our aggregation strategy for each column
# (except 2), let us now aggregate:
# Note that SK_ID_CURR now becomes an index to data
grouped = bureau.groupby('SK_ID_CURR')
bureau_agg = grouped.agg({**num_aggregations, **cat_aggregations})
# 7.6
bureau_agg.head()
bureau_agg.shape # (305811, 62) (including newly created min, max etc columns)
DAYS_CREDIT | DAYS_CREDIT_ENDDATE | DAYS_CREDIT_UPDATE | CREDIT_DAY_OVERDUE | ... | CREDIT_TYPE_nan | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | mean | var | min | max | mean | mean | max | mean | ... | mean | mean | mean | mean | mean | mean | mean | mean | mean | mean | |
SK_ID_CURR | |||||||||||||||||||||
100001 | -1572 | -49 | -735.000000 | 240043.666667 | -1329.0 | 1778.0 | 82.428571 | -93.142857 | 0 | 0.0 | ... | 0 | 0.336651 | 0.007519 | 0.0 | 0.0 | 0.0 | 0.0 | 0.441240 | 0.214590 | 0.0 |
100002 | -1437 | -103 | -874.000000 | 186150.000000 | -1072.0 | 780.0 | -349.000000 | -499.875000 | 0 | 0.0 | ... | 0 | 0.406960 | 0.255682 | 0.0 | 0.0 | 0.0 | 0.0 | 0.175426 | 0.161932 | 0.0 |
100003 | -2586 | -606 | -1400.750000 | 827783.583333 | -2434.0 | 1216.0 | -544.500000 | -816.000000 | 0 | 0.0 | ... | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100004 | -1326 | -408 | -867.000000 | 421362.000000 | -595.0 | -382.0 | -488.500000 | -532.000000 | 0 | 0.0 | ... | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100005 | -373 | -62 | -190.666667 | 26340.333333 | -128.0 | 1324.0 | 439.333333 | -54.333333 | 0 | 0.0 | ... | 0 | 0.735043 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.128205 | 0.136752 | 0.0 |
5 rows × 62 columns
(305811, 62)
# 7.7 Remove hierarchical index from bureau_agg
bureau_agg.columns # 62
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])
MultiIndex([( 'DAYS_CREDIT', ...),
( 'DAYS_CREDIT', ...),
( 'DAYS_CREDIT', ...),
( 'DAYS_CREDIT', ...),
( 'DAYS_CREDIT_ENDDATE', ...),
( 'DAYS_CREDIT_ENDDATE', ...),
( 'DAYS_CREDIT_ENDDATE', ...),
( 'DAYS_CREDIT_UPDATE', ...),
( 'CREDIT_DAY_OVERDUE', ...),
( 'CREDIT_DAY_OVERDUE', ...),
( 'AMT_CREDIT_MAX_OVERDUE', ...),
( 'AMT_CREDIT_SUM', ...),
( 'AMT_CREDIT_SUM', ...),
( 'AMT_CREDIT_SUM', ...),
( 'AMT_CREDIT_SUM_DEBT', ...),
( 'AMT_CREDIT_SUM_DEBT', ...),
( 'AMT_CREDIT_SUM_DEBT', ...),
( 'AMT_CREDIT_SUM_OVERDUE', ...),
( 'AMT_CREDIT_SUM_LIMIT', ...),
( 'AMT_CREDIT_SUM_LIMIT', ...),
( 'AMT_ANNUITY', ...),
( 'AMT_ANNUITY', ...),
( 'CNT_CREDIT_PROLONG', ...),
( 'MONTHS_BALANCE_MIN', ...),
( 'MONTHS_BALANCE_MAX', ...),
( 'MONTHS_BALANCE_SIZE', ...),
( 'MONTHS_BALANCE_SIZE', ...),
( 'CREDIT_ACTIVE_Active', ...),
( 'CREDIT_ACTIVE_Bad debt', ...),
( 'CREDIT_ACTIVE_Closed', ...),
( 'CREDIT_ACTIVE_Sold', ...),
( 'CREDIT_ACTIVE_nan', ...),
( 'CREDIT_CURRENCY_currency 1', ...),
( 'CREDIT_CURRENCY_currency 2', ...),
( 'CREDIT_CURRENCY_currency 3', ...),
( 'CREDIT_CURRENCY_currency 4', ...),
( 'CREDIT_CURRENCY_nan', ...),
( 'CREDIT_TYPE_Another type of loan', ...),
( 'CREDIT_TYPE_Car loan', ...),
( 'CREDIT_TYPE_Cash loan (non-earmarked)', ...),
( 'CREDIT_TYPE_Consumer credit', ...),
( 'CREDIT_TYPE_Credit card', ...),
( 'CREDIT_TYPE_Interbank credit', ...),
( 'CREDIT_TYPE_Loan for business development', ...),
('CREDIT_TYPE_Loan for purchase of shares (margin lending)', ...),
( 'CREDIT_TYPE_Loan for the purchase of equipment', ...),
( 'CREDIT_TYPE_Loan for working capital replenishment', ...),
( 'CREDIT_TYPE_Microloan', ...),
( 'CREDIT_TYPE_Mobile operator loan', ...),
( 'CREDIT_TYPE_Mortgage', ...),
( 'CREDIT_TYPE_Real estate loan', ...),
( 'CREDIT_TYPE_Unknown type of loan', ...),
( 'CREDIT_TYPE_nan', ...),
( 'STATUS_0_MEAN', ...),
( 'STATUS_1_MEAN', ...),
( 'STATUS_2_MEAN', ...),
( 'STATUS_3_MEAN', ...),
( 'STATUS_4_MEAN', ...),
( 'STATUS_5_MEAN', ...),
( 'STATUS_C_MEAN', ...),
( 'STATUS_X_MEAN', ...),
( 'STATUS_nan_MEAN', ...)],
)
# 7.8
bureau_agg.head()
# 7.8.1 Note that SK_ID_CURR is now an index to table
bureau_agg.columns # 62: Due to creation of min, max, var etc columns
BURO_DAYS_CREDIT_MIN | BURO_DAYS_CREDIT_MAX | BURO_DAYS_CREDIT_MEAN | BURO_DAYS_CREDIT_VAR | BURO_DAYS_CREDIT_ENDDATE_MIN | BURO_DAYS_CREDIT_ENDDATE_MAX | BURO_DAYS_CREDIT_ENDDATE_MEAN | BURO_DAYS_CREDIT_UPDATE_MEAN | BURO_CREDIT_DAY_OVERDUE_MAX | BURO_CREDIT_DAY_OVERDUE_MEAN | ... | BURO_CREDIT_TYPE_nan_MEAN | BURO_STATUS_0_MEAN_MEAN | BURO_STATUS_1_MEAN_MEAN | BURO_STATUS_2_MEAN_MEAN | BURO_STATUS_3_MEAN_MEAN | BURO_STATUS_4_MEAN_MEAN | BURO_STATUS_5_MEAN_MEAN | BURO_STATUS_C_MEAN_MEAN | BURO_STATUS_X_MEAN_MEAN | BURO_STATUS_nan_MEAN_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
100001 | -1572 | -49 | -735.000000 | 240043.666667 | -1329.0 | 1778.0 | 82.428571 | -93.142857 | 0 | 0.0 | ... | 0 | 0.336651 | 0.007519 | 0.0 | 0.0 | 0.0 | 0.0 | 0.441240 | 0.214590 | 0.0 |
100002 | -1437 | -103 | -874.000000 | 186150.000000 | -1072.0 | 780.0 | -349.000000 | -499.875000 | 0 | 0.0 | ... | 0 | 0.406960 | 0.255682 | 0.0 | 0.0 | 0.0 | 0.0 | 0.175426 | 0.161932 | 0.0 |
100003 | -2586 | -606 | -1400.750000 | 827783.583333 | -2434.0 | 1216.0 | -544.500000 | -816.000000 | 0 | 0.0 | ... | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100004 | -1326 | -408 | -867.000000 | 421362.000000 | -595.0 | -382.0 | -488.500000 | -532.000000 | 0 | 0.0 | ... | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
100005 | -373 | -62 | -190.666667 | 26340.333333 | -128.0 | 1324.0 | 439.333333 | -54.333333 | 0 | 0.0 | ... | 0 | 0.735043 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.128205 | 0.136752 | 0.0 |
5 rows × 62 columns
Index(['BURO_DAYS_CREDIT_MIN', 'BURO_DAYS_CREDIT_MAX', 'BURO_DAYS_CREDIT_MEAN',
'BURO_DAYS_CREDIT_VAR', 'BURO_DAYS_CREDIT_ENDDATE_MIN',
'BURO_DAYS_CREDIT_ENDDATE_MAX', 'BURO_DAYS_CREDIT_ENDDATE_MEAN',
'BURO_DAYS_CREDIT_UPDATE_MEAN', 'BURO_CREDIT_DAY_OVERDUE_MAX',
'BURO_CREDIT_DAY_OVERDUE_MEAN', 'BURO_AMT_CREDIT_MAX_OVERDUE_MEAN',
'BURO_AMT_CREDIT_SUM_MAX', 'BURO_AMT_CREDIT_SUM_MEAN',
'BURO_AMT_CREDIT_SUM_SUM', 'BURO_AMT_CREDIT_SUM_DEBT_MAX',
'BURO_AMT_CREDIT_SUM_DEBT_MEAN', 'BURO_AMT_CREDIT_SUM_DEBT_SUM',
'BURO_AMT_CREDIT_SUM_OVERDUE_MEAN', 'BURO_AMT_CREDIT_SUM_LIMIT_MEAN',
'BURO_AMT_CREDIT_SUM_LIMIT_SUM', 'BURO_AMT_ANNUITY_MAX',
'BURO_AMT_ANNUITY_MEAN', 'BURO_CNT_CREDIT_PROLONG_SUM',
'BURO_MONTHS_BALANCE_MIN_MIN', 'BURO_MONTHS_BALANCE_MAX_MAX',
'BURO_MONTHS_BALANCE_SIZE_MEAN', 'BURO_MONTHS_BALANCE_SIZE_SUM',
'BURO_CREDIT_ACTIVE_Active_MEAN', 'BURO_CREDIT_ACTIVE_Bad debt_MEAN',
'BURO_CREDIT_ACTIVE_Closed_MEAN', 'BURO_CREDIT_ACTIVE_Sold_MEAN',
'BURO_CREDIT_ACTIVE_nan_MEAN', 'BURO_CREDIT_CURRENCY_currency 1_MEAN',
'BURO_CREDIT_CURRENCY_currency 2_MEAN',
'BURO_CREDIT_CURRENCY_currency 3_MEAN',
'BURO_CREDIT_CURRENCY_currency 4_MEAN', 'BURO_CREDIT_CURRENCY_nan_MEAN',
'BURO_CREDIT_TYPE_Another type of loan_MEAN',
'BURO_CREDIT_TYPE_Car loan_MEAN',
'BURO_CREDIT_TYPE_Cash loan (non-earmarked)_MEAN',
'BURO_CREDIT_TYPE_Consumer credit_MEAN',
'BURO_CREDIT_TYPE_Credit card_MEAN',
'BURO_CREDIT_TYPE_Interbank credit_MEAN',
'BURO_CREDIT_TYPE_Loan for business development_MEAN',
'BURO_CREDIT_TYPE_Loan for purchase of shares (margin lending)_MEAN',
'BURO_CREDIT_TYPE_Loan for the purchase of equipment_MEAN',
'BURO_CREDIT_TYPE_Loan for working capital replenishment_MEAN',
'BURO_CREDIT_TYPE_Microloan_MEAN',
'BURO_CREDIT_TYPE_Mobile operator loan_MEAN',
'BURO_CREDIT_TYPE_Mortgage_MEAN',
'BURO_CREDIT_TYPE_Real estate loan_MEAN',
'BURO_CREDIT_TYPE_Unknown type of loan_MEAN',
'BURO_CREDIT_TYPE_nan_MEAN', 'BURO_STATUS_0_MEAN_MEAN',
'BURO_STATUS_1_MEAN_MEAN', 'BURO_STATUS_2_MEAN_MEAN',
'BURO_STATUS_3_MEAN_MEAN', 'BURO_STATUS_4_MEAN_MEAN',
'BURO_STATUS_5_MEAN_MEAN', 'BURO_STATUS_C_MEAN_MEAN',
'BURO_STATUS_X_MEAN_MEAN', 'BURO_STATUS_nan_MEAN_MEAN'],
dtype='object')
# 7.9 No duplicate index
bureau_agg.index.nunique() # 305811
len(set(bureau_agg.index)) # 305811
305811
305811
More Aggregation and merger¶
We now filter bureau on CREDIT_ACTIVE_Active feature. This will create two subsets of data. This feature has values of 1 and 0.
Filter data where CREDIT_ACTIVE_Active value is 1. Then aggregate(only) numeric features of this filtered data-subset by grouping on SK_ID_CURR. Next, filter, bureau, on CREDIT_ACTIVE_Closed = 1 . And again aggregate the subset on numeric features. Merge all these with bureau_agg (NOT bureau.)
It is as if we are trying to extract the behaviour of those whose credits are active and those whose credits are closed.
# 8.0 In which cases credit is active? Filter data
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
active.head()
active.shape # (630607, 51)
SK_ID_CURR | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | ... | MONTHS_BALANCE_SIZE | STATUS_0_MEAN | STATUS_1_MEAN | STATUS_2_MEAN | STATUS_3_MEAN | STATUS_4_MEAN | STATUS_5_MEAN | STATUS_C_MEAN | STATUS_X_MEAN | STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 215354 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.00 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 215354 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 215354 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 215354 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 215354 | -273 | 0 | 27460.0 | NaN | 0.0 | 0 | 180000.0 | 71017.38 | 108982.62 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 51 columns
(630607, 51)
# 8.1 Aggregate numercial features of the filtered subset over SK_ID_CURR
active_agg = active.groupby('SK_ID_CURR').agg(num_aggregations)
# 8.1.1
active_agg.head()
active_agg.shape # (251815, 27)
DAYS_CREDIT | DAYS_CREDIT_ENDDATE | DAYS_CREDIT_UPDATE | CREDIT_DAY_OVERDUE | ... | AMT_CREDIT_SUM_OVERDUE | AMT_CREDIT_SUM_LIMIT | AMT_ANNUITY | CNT_CREDIT_PROLONG | MONTHS_BALANCE_MIN | MONTHS_BALANCE_MAX | MONTHS_BALANCE_SIZE | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | mean | var | min | max | mean | mean | max | mean | ... | mean | mean | sum | max | mean | sum | min | max | mean | sum | |
SK_ID_CURR | |||||||||||||||||||||
100001 | -559 | -49 | -309.333333 | 65110.333333 | 411.0 | 1778.0 | 1030.333333 | -10.666667 | 0 | 0.0 | ... | 0.0 | 0.0000 | 0.000 | 10822.5 | 8272.50 | 0 | -18.0 | 0.0 | 10.666667 | 32.0 |
100002 | -1042 | -103 | -572.500000 | 440860.500000 | 780.0 | 780.0 | 780.000000 | -15.500000 | 0 | 0.0 | ... | 0.0 | 15994.2825 | 31988.565 | 0.0 | 0.00 | 0 | -34.0 | 0.0 | 10.000000 | 20.0 |
100003 | -606 | -606 | -606.000000 | NaN | 1216.0 | 1216.0 | 1216.000000 | -43.000000 | 0 | 0.0 | ... | 0.0 | 810000.0000 | 810000.000 | NaN | NaN | 0 | NaN | NaN | NaN | 0.0 |
100005 | -137 | -62 | -99.500000 | 2812.500000 | 122.0 | 1324.0 | 723.000000 | -21.000000 | 0 | 0.0 | ... | 0.0 | 0.0000 | 0.000 | 4261.5 | 2130.75 | 0 | -4.0 | 0.0 | 4.000000 | 8.0 |
100008 | -78 | -78 | -78.000000 | NaN | 471.0 | 471.0 | 471.000000 | -16.000000 | 0 | 0.0 | ... | 0.0 | 0.0000 | 0.000 | NaN | NaN | 0 | NaN | NaN | NaN | 0.0 |
5 rows × 27 columns
(251815, 27)
# 8.1.2 Rename multi-indexed columns
active_agg.columns = pd.Index(['ACTIVE_' + e[0] + "_" + e[1].upper() for e in active_agg.columns.tolist()])
active_agg.columns
Index(['ACTIVE_DAYS_CREDIT_MIN', 'ACTIVE_DAYS_CREDIT_MAX',
'ACTIVE_DAYS_CREDIT_MEAN', 'ACTIVE_DAYS_CREDIT_VAR',
'ACTIVE_DAYS_CREDIT_ENDDATE_MIN', 'ACTIVE_DAYS_CREDIT_ENDDATE_MAX',
'ACTIVE_DAYS_CREDIT_ENDDATE_MEAN', 'ACTIVE_DAYS_CREDIT_UPDATE_MEAN',
'ACTIVE_CREDIT_DAY_OVERDUE_MAX', 'ACTIVE_CREDIT_DAY_OVERDUE_MEAN',
'ACTIVE_AMT_CREDIT_MAX_OVERDUE_MEAN', 'ACTIVE_AMT_CREDIT_SUM_MAX',
'ACTIVE_AMT_CREDIT_SUM_MEAN', 'ACTIVE_AMT_CREDIT_SUM_SUM',
'ACTIVE_AMT_CREDIT_SUM_DEBT_MAX', 'ACTIVE_AMT_CREDIT_SUM_DEBT_MEAN',
'ACTIVE_AMT_CREDIT_SUM_DEBT_SUM', 'ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN',
'ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN', 'ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM',
'ACTIVE_AMT_ANNUITY_MAX', 'ACTIVE_AMT_ANNUITY_MEAN',
'ACTIVE_CNT_CREDIT_PROLONG_SUM', 'ACTIVE_MONTHS_BALANCE_MIN_MIN',
'ACTIVE_MONTHS_BALANCE_MAX_MAX', 'ACTIVE_MONTHS_BALANCE_SIZE_MEAN',
'ACTIVE_MONTHS_BALANCE_SIZE_SUM'],
dtype='object')
# 9.0 Difference between length of two datasets
active_agg_set = set(active_agg.index)
bureau_agg_set = set(bureau_agg.index)
len(bureau_agg_set) # 305811
len(active_agg_set) # 251815
list(bureau_agg_set - active_agg_set)[:4] # Few examples: {131074, 393220, 262149, 262153]
305811
251815
[131074, 393220, 262149, 262153]
# 9.1 Merge bureau_agg with active_agg over 'SK_ID_CURR'
bureau_agg = bureau_agg.join(
active_agg,
how='left',
on='SK_ID_CURR'
)
# 9.2
bureau_agg.shape # (305811, 89)
(305811, 89)
# 9.3 Obviouly some rows will hold NaN values for merged columns
bureau_agg.loc[[131074,393220,262149, 262153]]
BURO_DAYS_CREDIT_MIN | BURO_DAYS_CREDIT_MAX | BURO_DAYS_CREDIT_MEAN | BURO_DAYS_CREDIT_VAR | BURO_DAYS_CREDIT_ENDDATE_MIN | BURO_DAYS_CREDIT_ENDDATE_MAX | BURO_DAYS_CREDIT_ENDDATE_MEAN | BURO_DAYS_CREDIT_UPDATE_MEAN | BURO_CREDIT_DAY_OVERDUE_MAX | BURO_CREDIT_DAY_OVERDUE_MEAN | ... | ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN | ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN | ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM | ACTIVE_AMT_ANNUITY_MAX | ACTIVE_AMT_ANNUITY_MEAN | ACTIVE_CNT_CREDIT_PROLONG_SUM | ACTIVE_MONTHS_BALANCE_MIN_MIN | ACTIVE_MONTHS_BALANCE_MAX_MAX | ACTIVE_MONTHS_BALANCE_SIZE_MEAN | ACTIVE_MONTHS_BALANCE_SIZE_SUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||
131074 | -2419 | -2419 | -2419.0 | NaN | -1301.0 | -1301.0 | -1301.0 | -1707.0 | 0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
393220 | -478 | -478 | -478.0 | NaN | NaN | NaN | NaN | -288.0 | 0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
262149 | -373 | -373 | -373.0 | NaN | -251.0 | -251.0 | -251.0 | -243.0 | 0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
262153 | -1677 | -1677 | -1677.0 | NaN | -1589.0 | -1589.0 | -1589.0 | -1012.0 | 0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 rows × 89 columns
# 9.4 Release memory
del active, active_agg
gc.collect()
182
# 10.0 Same steps for the CREDIT_ACTIVE_Closed =1 cases
# Bureau: Closed credits - using only numerical aggregations
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
closed_agg = closed.groupby('SK_ID_CURR').agg(num_aggregations)
closed_agg.columns = pd.Index(['CLOSED_' + e[0] + "_" + e[1].upper() for e in closed_agg.columns.tolist()])
bureau_agg = bureau_agg.join(closed_agg, how='left', on='SK_ID_CURR')
# 10.1
bureau_agg.shape # (305811, 116)
(305811, 116)
# 10.2
del closed, closed_agg, bureau
gc.collect()
68
# 10.3 SK_ID_CURR is index. Index is also saved by-default.
bureau_agg.to_csv("processed_bureau_agg.csv.zip", compression = "zip")
##################