Bureau and Bureau Balance data¶

*bureau.csv* data concerns client's earlier credits from other financial institutions. Some of the credits may be active and some are closed. Each previous (or ongoing) credit has its own row (only one row per credit) in *bureau* dataset. As a single client might have taken other loans from other financial institutions, for each row in the *application_train* data (ie *application_train.csv*) we can have multiple rows in this table. Feature explanations for this dataset are as below.

Feature explanations¶

Bureau table¶

SK_ID_CURR: ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau
SK_BUREAU_ID: Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application)
CREDIT_ACTIVE: Status of the Credit Bureau (CB) reported credits
CREDIT_CURRENCY: Recoded currency of the Credit Bureau credit
DAYS_CREDIT: How many days before current application did client apply for Credit Bureau credit
CREDIT_DAY_OVERDUE: Number of days past due on CB credit at the time of application for related loan in our sample
DAYS_CREDIT_ENDDATE: Remaining duration of CB credit (in days) at the time of application in Home Credit
DAYS_ENDDATE_FACT: Days since CB credit ended at the time of application in Home Credit (only for closed credit)
AMT_CREDIT_MAX_OVERDUE: Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)
CNT_CREDIT_PROLONG: How many times was the Credit Bureau credit prolonged
AMT_CREDIT_SUM: Current credit amount for the Credit Bureau credit
AMT_CREDIT_SUM_DEBT: Current debt on Credit Bureau credit
AMT_CREDIT_SUM_LIMIT: Current credit limit of credit card reported in Credit Bureau
AMT_CREDIT_SUM_OVERDUE: Current amount overdue on Credit Bureau credit
CREDIT_TYPE: Type of Credit Bureau credit (Car, cash,...)
DAYS_CREDIT_UPDATE: How many days before loan application did last information about the Credit Bureau credit come
AMT_ANNUITY: Annuity of the Credit Bureau credit

Bureau Balance table¶

SK_BUREAU_ID: Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date) time only relative to the application
STATUS: Status of Credit Bureau loan during the month

# Last amended: 21st October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing bureau and bureau_balance datasets.
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features

# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)

# 1.3
pd.set_option('display.max_colwidth', -1)

# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 2.0 One-hot encoding function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)

# 3.1 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

# 3.2 Read bureau data first
bureau = pd.read_csv(
                     'bureau.csv.zip',
                     nrows = None    # Read all rows
                    )

# 3.2.1 Reduce memory usage by appropriately
#       changing data-types per feature:

bureau = reducing.Reducer().reduce(bureau)

reduced df from 222.6203 MB to 165.3284 MB in 1.56 seconds

# 3.2.2 Explore data now
bureau.head(5)
bureau.shape   # (rows:17,16,428, cols: 17)
bureau.dtypes

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

(1716428, 17)

SK_ID_CURR                uint32 
SK_ID_BUREAU              uint32 
CREDIT_ACTIVE             object 
CREDIT_CURRENCY           object 
DAYS_CREDIT               int16  
CREDIT_DAY_OVERDUE        uint16 
DAYS_CREDIT_ENDDATE       float64
DAYS_ENDDATE_FACT         float64
AMT_CREDIT_MAX_OVERDUE    float64
CNT_CREDIT_PROLONG        uint8  
AMT_CREDIT_SUM            float64
AMT_CREDIT_SUM_DEBT       float64
AMT_CREDIT_SUM_LIMIT      float64
AMT_CREDIT_SUM_OVERDUE    float32
CREDIT_TYPE               object 
DAYS_CREDIT_UPDATE        int32  
AMT_ANNUITY               float64
dtype: object

# 3.2.3 In all, how many are categoricals?
bureau.dtypes.value_counts()

float64    7
object     3
uint32     2
float32    1
int32      1
uint16     1
uint8      1
int16      1
dtype: int64

# 3.3
bureau.shape                       # (1716428, 17)

# 3.3.1
# What is the actual number of persons
#  who might have taken multiple loans?

bureau['SK_ID_CURR'].nunique()     # 305811  -- Many duplicate values exist
                                   #            Consider SK_ID_CURR as Foreign Key
                                   #            Primary key exists in application_train data
                                   # Primary key: SK_ID_BUREAU
            
# 3.3.2
# As expected, there are no duplicate values here
bureau['SK_ID_BUREAU'].nunique()   # 1716428 -- Unique id for each row 

(1716428, 17)

# 3.4 Summary of active/closed cases from bureau
# We aggregate on these also
bureau['CREDIT_ACTIVE'].value_counts()

Closed      1079273
Active      630607 
Sold        6527   
Bad debt    21     
Name: CREDIT_ACTIVE, dtype: int64

Aggregation¶

bureau_balance will be aggregated and merged with bureau. bureau will then be aggregated and merged with 'application_train' data. bureau will be aggregated in three different ways. This aggregation will be by SK_ID_CURR. Finally, aggregated bureau, called bureau_agg, will be merged with 'application_train' over (SK_ID_CURR).
Aggregation over time is one way to extract behaviour of client. All categorical data is first OneHotEncoded (OHE). What is unique about this OHE is that NaN values are treated as categories.
# 4.0 OneHotEncode 'object' types in bureau
bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category)
# 4.1
bureau.head()
bureau.shape          # (1716428, 40); 17-->40
print(bureau_cat)     # List of added columns
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan CREDIT_TYPE_nan

0 215354 5714462 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 ... 0 0 0 0 0 0 0 0 0 0

1 215354 5714463 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 ... 0 0 0 0 0 0 0 0 0 0

2 215354 5714464 -203 0 528.0 NaN NaN 0 464323.5 NaN ... 0 0 0 0 0 0 0 0 0 0

3 215354 5714465 -203 0 NaN NaN NaN 0 90000.0 NaN ... 0 0 0 0 0 0 0 0 0 0

4 215354 5714466 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN ... 0 0 0 0 0 0 0 0 0 0

5 rows × 40 columns
(1716428, 40)
['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt', 'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan', 'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2', 'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4', 'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)', 'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card', 'CREDIT_TYPE_Interbank credit', 'CREDIT_TYPE_Loan for business development', 'CREDIT_TYPE_Loan for purchase of shares (margin lending)', 'CREDIT_TYPE_Loan for the purchase of equipment', 'CREDIT_TYPE_Loan for working capital replenishment', 'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan', 'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan', 'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan']

bureau_balance¶

It is monthly data about the remaining balance of each one of the previous credits of clients that exist in dataset bureau. Each previous credit is identified by a unique ID, SK_ID_BUREAU, in dataset bureau. Each row in bureau_balance is one month of credit-due (from previous credit), and a single previous credit can have multiple rows, one for each month of the credit length.
In my personal view, it should be in decreasing order. That is, for every person identified by SK_ID_BUREAU, credits should be decreasing each passing month.

# 5.0 Read over bureau_balance data
#     and reduce memory usage through
#     conversion of data-types:

bb = pd.read_csv('bureau_balance.csv.zip', nrows = None)
bb = reducing.Reducer().reduce(bb)

reduced df from 624.8458 MB to 338.4582 MB in 4.78 seconds

# 5.0.1 Display few rows 
bb.head(10)

# 5.0.2 & Compare
bb.shape      # (27299925, 3) 
bureau.shape  # (1716428, 17)

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C
5	5715448	-5	C
6	5715448	-6	C
7	5715448	-7	C
8	5715448	-8	C
9	5715448	-9	0

(27299925, 3)

(1716428, 40)

# 5.1 There is just one 'object' column
bb.dtypes.value_counts()

object    1
int8      1
uint32    1
dtype: int64

# 5.2 Is the data about all bureau cases?
#      No, it appears it is not for all cases

bb['SK_ID_BUREAU'].nunique()    # 817395 << 1716428

# 5.3 Just which cases are present in 'bureau' but absent
#     in 'bb'
bb_id_set = set(bb['SK_ID_BUREAU'])             # Set of IDs in bb
bureau_id_set = set(bureau['SK_ID_BUREAU'])     # Set of IDs in bureau

# 5.4 And here is the difference list.
#      How many of them? 
list(bureau_id_set - bb_id_set)[:5]      # sample [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]
len(bureau_id_set - bb_id_set) # 942074

[6292791, 6292792, 6292793, 6292795, 6292796]

# 5.5 OK. So let us OneHotEncode bb
bb, bb_cat = one_hot_encoder(bb, nan_as_category)

# 5.6 Examine the results
bb.head()
bb.shape   # (27299925, 11) ; 3-->11
           # 1 (ID) + 1 (numeric) + 9 (dummy)
bb_cat     # New columns added

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS_C
0	5715448	0	1
1	5715448	-1	1
2	5715448	-2	1
3	5715448	-3	1
4	5715448	-4	1

(27299925, 11)

['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']

Performing aggregations in bb¶

There is one numeric feature: 'MONTHS_BALANCE'. On this feature we will perform ['min', 'max', 'size']. And on the rest of the features,dummy features, we will perform [mean]. Aggregation is by unique bureau ID, SK_ID_BUREAU. Resulting dataset is called bureau_agg.

# 6.0 Bureau balance: Perform aggregations and merge with bureau.csv
#     First prepare a dictionary listing operations to be performed
#     on various features:

bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in bb_cat:
    bb_aggregations[col] = ['mean']

# 6.0.1    
len(bb_aggregations)     # 10  

# 6.1 So what all aggregations to perform column-wise

bb_aggregations

{'MONTHS_BALANCE': ['min', 'max', 'size'],
 'STATUS_0': ['mean'],
 'STATUS_1': ['mean'],
 'STATUS_2': ['mean'],
 'STATUS_3': ['mean'],
 'STATUS_4': ['mean'],
 'STATUS_5': ['mean'],
 'STATUS_C': ['mean'],
 'STATUS_X': ['mean'],
 'STATUS_nan': ['mean']}

# 6.2 Perform aggregations now in bb:

grouped =  bb.groupby('SK_ID_BUREAU')
bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)

# 6.3
bb_agg.shape      # (817395, 12)
bb_agg.columns

# 6.3.1 Note that 'SK_ID_BUREAU'
#       the grouping column is
#       now table-index

bb_agg.head()

(817395, 12)

MultiIndex([('MONTHS_BALANCE',  'min'),
            ('MONTHS_BALANCE',  'max'),
            ('MONTHS_BALANCE', 'size'),
            (      'STATUS_0', 'mean'),
            (      'STATUS_1', 'mean'),
            (      'STATUS_2', 'mean'),
            (      'STATUS_3', 'mean'),
            (      'STATUS_4', 'mean'),
            (      'STATUS_5', 'mean'),
            (      'STATUS_C', 'mean'),
            (      'STATUS_X', 'mean'),
            (    'STATUS_nan', 'mean')],
           )

	MONTHS_BALANCE			STATUS_0	STATUS_1	STATUS_2	STATUS_3	STATUS_4	STATUS_5	STATUS_C	STATUS_X	STATUS_nan
	min	max	size	mean	mean	mean	mean	mean	mean	mean	mean	mean
SK_ID_BUREAU
5001709	-96	0	97	0.000000	0.0	0.0	0.0	0.0	0.0	0.886598	0.113402	0
5001710	-82	0	83	0.060241	0.0	0.0	0.0	0.0	0.0	0.578313	0.361446	0
5001711	-3	0	4	0.750000	0.0	0.0	0.0	0.0	0.0	0.000000	0.250000	0
5001712	-18	0	19	0.526316	0.0	0.0	0.0	0.0	0.0	0.473684	0.000000	0
5001713	-21	0	22	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	1.000000	0

# 6.4 Rename bb_agg columns
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])

# 6.4.1
bb_agg.columns.tolist()
bb_agg.head()

['MONTHS_BALANCE_MIN',
 'MONTHS_BALANCE_MAX',
 'MONTHS_BALANCE_SIZE',
 'STATUS_0_MEAN',
 'STATUS_1_MEAN',
 'STATUS_2_MEAN',
 'STATUS_3_MEAN',
 'STATUS_4_MEAN',
 'STATUS_5_MEAN',
 'STATUS_C_MEAN',
 'STATUS_X_MEAN',
 'STATUS_nan_MEAN']

	MONTHS_BALANCE_MIN	MONTHS_BALANCE_MAX	MONTHS_BALANCE_SIZE	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
SK_ID_BUREAU
5001709	-96	0	97	0.000000	0.0	0.0	0.0	0.0	0.0	0.886598	0.113402	0
5001710	-82	0	83	0.060241	0.0	0.0	0.0	0.0	0.0	0.578313	0.361446	0
5001711	-3	0	4	0.750000	0.0	0.0	0.0	0.0	0.0	0.000000	0.250000	0
5001712	-18	0	19	0.526316	0.0	0.0	0.0	0.0	0.0	0.473684	0.000000	0
5001713	-21	0	22	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	1.000000	0

# 6.5 Merge aggregated bb with bureau

bureau = bureau.join(
                     bb_agg,
                     how='left',
                     on='SK_ID_BUREAU'
                    )

# 6.5.1

bureau.head()
bureau.shape   # (1716428, 52)
bureau.dtypes  

	SK_ID_CURR	SK_ID_BUREAU	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	...	MONTHS_BALANCE_SIZE	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
0	215354	5714462	-497	-153.0	-153.0	NaN	91323.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	215354	5714463	-208	1075.0	NaN	NaN	225000.0	171342.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	215354	5714464	-203	528.0	NaN	NaN	464323.5	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	215354	5714465	-203	NaN	NaN	NaN	90000.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	215354	5714466	-629	1197.0	NaN	77674.5	2700000.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 52 columns

(1716428, 52)

SK_ID_CURR                                                  uint32 
SK_ID_BUREAU                                                uint32 
DAYS_CREDIT                                                 int16  
CREDIT_DAY_OVERDUE                                          uint16 
DAYS_CREDIT_ENDDATE                                         float64
DAYS_ENDDATE_FACT                                           float64
AMT_CREDIT_MAX_OVERDUE                                      float64
CNT_CREDIT_PROLONG                                          uint8  
AMT_CREDIT_SUM                                              float64
AMT_CREDIT_SUM_DEBT                                         float64
AMT_CREDIT_SUM_LIMIT                                        float64
AMT_CREDIT_SUM_OVERDUE                                      float32
DAYS_CREDIT_UPDATE                                          int32  
AMT_ANNUITY                                                 float64
CREDIT_ACTIVE_Active                                        uint8  
CREDIT_ACTIVE_Bad debt                                      uint8  
CREDIT_ACTIVE_Closed                                        uint8  
CREDIT_ACTIVE_Sold                                          uint8  
CREDIT_ACTIVE_nan                                           uint8  
CREDIT_CURRENCY_currency 1                                  uint8  
CREDIT_CURRENCY_currency 2                                  uint8  
CREDIT_CURRENCY_currency 3                                  uint8  
CREDIT_CURRENCY_currency 4                                  uint8  
CREDIT_CURRENCY_nan                                         uint8  
CREDIT_TYPE_Another type of loan                            uint8  
CREDIT_TYPE_Car loan                                        uint8  
CREDIT_TYPE_Cash loan (non-earmarked)                       uint8  
CREDIT_TYPE_Consumer credit                                 uint8  
CREDIT_TYPE_Credit card                                     uint8  
CREDIT_TYPE_Interbank credit                                uint8  
CREDIT_TYPE_Loan for business development                   uint8  
CREDIT_TYPE_Loan for purchase of shares (margin lending)    uint8  
CREDIT_TYPE_Loan for the purchase of equipment              uint8  
CREDIT_TYPE_Loan for working capital replenishment          uint8  
CREDIT_TYPE_Microloan                                       uint8  
CREDIT_TYPE_Mobile operator loan                            uint8  
CREDIT_TYPE_Mortgage                                        uint8  
CREDIT_TYPE_Real estate loan                                uint8  
CREDIT_TYPE_Unknown type of loan                            uint8  
CREDIT_TYPE_nan                                             uint8  
MONTHS_BALANCE_MIN                                          float64
MONTHS_BALANCE_MAX                                          float64
MONTHS_BALANCE_SIZE                                         float64
STATUS_0_MEAN                                               float64
STATUS_1_MEAN                                               float64
STATUS_2_MEAN                                               float64
STATUS_3_MEAN                                               float64
STATUS_4_MEAN                                               float64
STATUS_5_MEAN                                               float64
STATUS_C_MEAN                                               float64
STATUS_X_MEAN                                               float64
STATUS_nan_MEAN                                             float64
dtype: object

# 6.5.2 Just for curiosity, what happened
#       to those rows in 'bureau' where there
#       was no matching record in bb_agg. The list
#       of such IDs is:
#       [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]

bureau[bureau['SK_ID_BUREAU'] ==6292791]

	SK_ID_CURR	SK_ID_BUREAU	DAYS_CREDIT	CREDIT_DAY_OVERDUE	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	CNT_CREDIT_PROLONG	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	...	MONTHS_BALANCE_SIZE	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
941854	190757	6292791	-707	0	-300.0	-300.0	0.0	0	2250000.0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 52 columns

# 6.6 Drop SK_ID_BUREAU as bb has finally merged.

bureau.drop(['SK_ID_BUREAU'],
            axis=1,
            inplace= True
           )

# We have three types of columns
# Categorical columns generated from bureau
# Categorical columns generated from bb
# Numerical columns

Performing aggregations in bureau¶

Aggregate 14 original numeric columns, as: ['min', 'max', 'mean', 'var']
Aggregate rest of the columns that is dummy columns as: [mean].
This constitutes one of the three aggretaions. Aggregation is by SK_ID_CURR. Resulting dataset is called bureau_agg

# 7.0 Have a look at bureau again.
#      SK_ID_CURR repeats for many cases.
#         So, there is a case for aggregation

bureau.shape     # (1716428, 51)
bureau.head()

(1716428, 51)

	SK_ID_CURR	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	...	MONTHS_BALANCE_SIZE	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
0	215354	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	215354	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	215354	-203	528.0	NaN	NaN	464323.5	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	215354	-203	NaN	NaN	NaN	90000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	215354	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 51 columns

## Aggregation strategy
# 7.1 Numeric features
#     Columns: Bureau + bureau_balance numeric features
#              Last three columns are from bureau_balance
#              Total: 11 + 3 = 14

num_aggregations = {
                     'DAYS_CREDIT':             ['min', 'max', 'mean', 'var'],
                     'DAYS_CREDIT_ENDDATE':     ['min', 'max', 'mean'],
                     'DAYS_CREDIT_UPDATE':      ['mean'],
                     'CREDIT_DAY_OVERDUE':      ['max', 'mean'],
                     'AMT_CREDIT_MAX_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM':          ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_DEBT':     ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM_LIMIT':    ['mean', 'sum'],
                     'AMT_ANNUITY':             ['max', 'mean'],
                     'CNT_CREDIT_PROLONG':      ['sum'],
                     'MONTHS_BALANCE_MIN':      ['min'],
                     'MONTHS_BALANCE_MAX':      ['max'],
                     'MONTHS_BALANCE_SIZE':     ['mean', 'sum']
                   }

len(num_aggregations)   # 14

# 7.2 Bureau categorical features. Derived from:
#       'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE', 
#        Total: 

cat_aggregations = {}
bureau_cat      # bureau_cat are newly created dummy columns
                #  but all are numerical columns

# 7.2.1    
len(bureau_cat) # 26    

['CREDIT_ACTIVE_Active',
 'CREDIT_ACTIVE_Bad debt',
 'CREDIT_ACTIVE_Closed',
 'CREDIT_ACTIVE_Sold',
 'CREDIT_ACTIVE_nan',
 'CREDIT_CURRENCY_currency 1',
 'CREDIT_CURRENCY_currency 2',
 'CREDIT_CURRENCY_currency 3',
 'CREDIT_CURRENCY_currency 4',
 'CREDIT_CURRENCY_nan',
 'CREDIT_TYPE_Another type of loan',
 'CREDIT_TYPE_Car loan',
 'CREDIT_TYPE_Cash loan (non-earmarked)',
 'CREDIT_TYPE_Consumer credit',
 'CREDIT_TYPE_Credit card',
 'CREDIT_TYPE_Interbank credit',
 'CREDIT_TYPE_Loan for business development',
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
 'CREDIT_TYPE_Loan for the purchase of equipment',
 'CREDIT_TYPE_Loan for working capital replenishment',
 'CREDIT_TYPE_Microloan',
 'CREDIT_TYPE_Mobile operator loan',
 'CREDIT_TYPE_Mortgage',
 'CREDIT_TYPE_Real estate loan',
 'CREDIT_TYPE_Unknown type of loan',
 'CREDIT_TYPE_nan']

# 7.2.2 For all these new dummy columns in bureau, we will
#       take mean
for cat in bureau_cat: cat_aggregations[cat] = ['mean']
cat_aggregations    

len(cat_aggregations)   # 26

{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 'CREDIT_TYPE_Real estate loan': ['mean'],
 'CREDIT_TYPE_Unknown type of loan': ['mean'],
 'CREDIT_TYPE_nan': ['mean']}

# 7.3.1 In addition, we have in bureau. columns that merged
#        from 'bb' ie bb_cat
#         So here is our full list
bb_cat
len(bb_cat)             # 9

# 7.3.2
for cat in bb_cat: cat_aggregations[cat + "_MEAN"] = ['mean']
cat_aggregations 

len(cat_aggregations)   # 26 + 9 = 35

['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']

{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 'CREDIT_TYPE_Real estate loan': ['mean'],
 'CREDIT_TYPE_Unknown type of loan': ['mean'],
 'CREDIT_TYPE_nan': ['mean'],
 'STATUS_0_MEAN': ['mean'],
 'STATUS_1_MEAN': ['mean'],
 'STATUS_2_MEAN': ['mean'],
 'STATUS_3_MEAN': ['mean'],
 'STATUS_4_MEAN': ['mean'],
 'STATUS_5_MEAN': ['mean'],
 'STATUS_C_MEAN': ['mean'],
 'STATUS_X_MEAN': ['mean'],
 'STATUS_nan_MEAN': ['mean']}

# 7.4 Have a look at bureau columns again
#      Just to compare above results with what
#       already exists

bureau.columns        # 51
len(bureau.columns)   # 35 (dummy) + 14 (num) + 1 (SK_ID_CURR) + 1 (DAYS_ENDDATE_FACT) = 51

Index(['SK_ID_CURR', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
       'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
       'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
       'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE',
       'AMT_ANNUITY', 'CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan',
       'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)',
       'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card',
       'CREDIT_TYPE_Interbank credit',
       'CREDIT_TYPE_Loan for business development',
       'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
       'CREDIT_TYPE_Loan for the purchase of equipment',
       'CREDIT_TYPE_Loan for working capital replenishment',
       'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan',
       'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan',
       'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan',
       'MONTHS_BALANCE_MIN', 'MONTHS_BALANCE_MAX', 'MONTHS_BALANCE_SIZE',
       'STATUS_0_MEAN', 'STATUS_1_MEAN', 'STATUS_2_MEAN', 'STATUS_3_MEAN',
       'STATUS_4_MEAN', 'STATUS_5_MEAN', 'STATUS_C_MEAN', 'STATUS_X_MEAN',
       'STATUS_nan_MEAN'],
      dtype='object')

# 7.5 Now that we have decided 
#     our aggregation strategy for each column
#      (except 2), let us now aggregate:
#         Note that SK_ID_CURR now becomes an index to data

grouped = bureau.groupby('SK_ID_CURR')
bureau_agg = grouped.agg({**num_aggregations, **cat_aggregations})

# 7.6
bureau_agg.head()
bureau_agg.shape  # (305811, 62) (including newly created min, max etc columns)

	DAYS_CREDIT				DAYS_CREDIT_ENDDATE			DAYS_CREDIT_UPDATE	CREDIT_DAY_OVERDUE		...	CREDIT_TYPE_nan	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
	min	max	mean	var	min	max	mean	mean	max	mean	...	mean	mean	mean	mean	mean	mean	mean	mean	mean	mean
SK_ID_CURR
100001	-1572	-49	-735.000000	240043.666667	-1329.0	1778.0	82.428571	-93.142857	0	0.0	...	0	0.336651	0.007519	0.0	0.0	0.0	0.0	0.441240	0.214590	0.0
100002	-1437	-103	-874.000000	186150.000000	-1072.0	780.0	-349.000000	-499.875000	0	0.0	...	0	0.406960	0.255682	0.0	0.0	0.0	0.0	0.175426	0.161932	0.0
100003	-2586	-606	-1400.750000	827783.583333	-2434.0	1216.0	-544.500000	-816.000000	0	0.0	...	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
100004	-1326	-408	-867.000000	421362.000000	-595.0	-382.0	-488.500000	-532.000000	0	0.0	...	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
100005	-373	-62	-190.666667	26340.333333	-128.0	1324.0	439.333333	-54.333333	0	0.0	...	0	0.735043	0.000000	0.0	0.0	0.0	0.0	0.128205	0.136752	0.0

5 rows × 62 columns

(305811, 62)

# 7.7 Remove hierarchical index from bureau_agg
bureau_agg.columns       # 62 
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])

MultiIndex([(                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                      'DAYS_CREDIT_UPDATE', ...),
            (                                      'CREDIT_DAY_OVERDUE', ...),
            (                                      'CREDIT_DAY_OVERDUE', ...),
            (                                  'AMT_CREDIT_MAX_OVERDUE', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                  'AMT_CREDIT_SUM_OVERDUE', ...),
            (                                    'AMT_CREDIT_SUM_LIMIT', ...),
            (                                    'AMT_CREDIT_SUM_LIMIT', ...),
            (                                             'AMT_ANNUITY', ...),
            (                                             'AMT_ANNUITY', ...),
            (                                      'CNT_CREDIT_PROLONG', ...),
            (                                      'MONTHS_BALANCE_MIN', ...),
            (                                      'MONTHS_BALANCE_MAX', ...),
            (                                     'MONTHS_BALANCE_SIZE', ...),
            (                                     'MONTHS_BALANCE_SIZE', ...),
            (                                    'CREDIT_ACTIVE_Active', ...),
            (                                  'CREDIT_ACTIVE_Bad debt', ...),
            (                                    'CREDIT_ACTIVE_Closed', ...),
            (                                      'CREDIT_ACTIVE_Sold', ...),
            (                                       'CREDIT_ACTIVE_nan', ...),
            (                              'CREDIT_CURRENCY_currency 1', ...),
            (                              'CREDIT_CURRENCY_currency 2', ...),
            (                              'CREDIT_CURRENCY_currency 3', ...),
            (                              'CREDIT_CURRENCY_currency 4', ...),
            (                                     'CREDIT_CURRENCY_nan', ...),
            (                        'CREDIT_TYPE_Another type of loan', ...),
            (                                    'CREDIT_TYPE_Car loan', ...),
            (                   'CREDIT_TYPE_Cash loan (non-earmarked)', ...),
            (                             'CREDIT_TYPE_Consumer credit', ...),
            (                                 'CREDIT_TYPE_Credit card', ...),
            (                            'CREDIT_TYPE_Interbank credit', ...),
            (               'CREDIT_TYPE_Loan for business development', ...),
            ('CREDIT_TYPE_Loan for purchase of shares (margin lending)', ...),
            (          'CREDIT_TYPE_Loan for the purchase of equipment', ...),
            (      'CREDIT_TYPE_Loan for working capital replenishment', ...),
            (                                   'CREDIT_TYPE_Microloan', ...),
            (                        'CREDIT_TYPE_Mobile operator loan', ...),
            (                                    'CREDIT_TYPE_Mortgage', ...),
            (                            'CREDIT_TYPE_Real estate loan', ...),
            (                        'CREDIT_TYPE_Unknown type of loan', ...),
            (                                         'CREDIT_TYPE_nan', ...),
            (                                           'STATUS_0_MEAN', ...),
            (                                           'STATUS_1_MEAN', ...),
            (                                           'STATUS_2_MEAN', ...),
            (                                           'STATUS_3_MEAN', ...),
            (                                           'STATUS_4_MEAN', ...),
            (                                           'STATUS_5_MEAN', ...),
            (                                           'STATUS_C_MEAN', ...),
            (                                           'STATUS_X_MEAN', ...),
            (                                         'STATUS_nan_MEAN', ...)],
           )

# 7.8
bureau_agg.head()
# 7.8.1 Note that SK_ID_CURR is now an index to table
bureau_agg.columns   # 62: Due to creation of min, max, var etc columns

	BURO_DAYS_CREDIT_MIN	BURO_DAYS_CREDIT_MAX	BURO_DAYS_CREDIT_MEAN	BURO_DAYS_CREDIT_VAR	BURO_DAYS_CREDIT_ENDDATE_MIN	BURO_DAYS_CREDIT_ENDDATE_MAX	BURO_DAYS_CREDIT_ENDDATE_MEAN	BURO_DAYS_CREDIT_UPDATE_MEAN	BURO_CREDIT_DAY_OVERDUE_MAX	BURO_CREDIT_DAY_OVERDUE_MEAN	...	BURO_CREDIT_TYPE_nan_MEAN	BURO_STATUS_0_MEAN_MEAN	BURO_STATUS_1_MEAN_MEAN	BURO_STATUS_2_MEAN_MEAN	BURO_STATUS_3_MEAN_MEAN	BURO_STATUS_4_MEAN_MEAN	BURO_STATUS_5_MEAN_MEAN	BURO_STATUS_C_MEAN_MEAN	BURO_STATUS_X_MEAN_MEAN	BURO_STATUS_nan_MEAN_MEAN
SK_ID_CURR
100001	-1572	-49	-735.000000	240043.666667	-1329.0	1778.0	82.428571	-93.142857	0	0.0	...	0	0.336651	0.007519	0.0	0.0	0.0	0.0	0.441240	0.214590	0.0
100002	-1437	-103	-874.000000	186150.000000	-1072.0	780.0	-349.000000	-499.875000	0	0.0	...	0	0.406960	0.255682	0.0	0.0	0.0	0.0	0.175426	0.161932	0.0
100003	-2586	-606	-1400.750000	827783.583333	-2434.0	1216.0	-544.500000	-816.000000	0	0.0	...	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
100004	-1326	-408	-867.000000	421362.000000	-595.0	-382.0	-488.500000	-532.000000	0	0.0	...	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
100005	-373	-62	-190.666667	26340.333333	-128.0	1324.0	439.333333	-54.333333	0	0.0	...	0	0.735043	0.000000	0.0	0.0	0.0	0.0	0.128205	0.136752	0.0

5 rows × 62 columns

Index(['BURO_DAYS_CREDIT_MIN', 'BURO_DAYS_CREDIT_MAX', 'BURO_DAYS_CREDIT_MEAN',
       'BURO_DAYS_CREDIT_VAR', 'BURO_DAYS_CREDIT_ENDDATE_MIN',
       'BURO_DAYS_CREDIT_ENDDATE_MAX', 'BURO_DAYS_CREDIT_ENDDATE_MEAN',
       'BURO_DAYS_CREDIT_UPDATE_MEAN', 'BURO_CREDIT_DAY_OVERDUE_MAX',
       'BURO_CREDIT_DAY_OVERDUE_MEAN', 'BURO_AMT_CREDIT_MAX_OVERDUE_MEAN',
       'BURO_AMT_CREDIT_SUM_MAX', 'BURO_AMT_CREDIT_SUM_MEAN',
       'BURO_AMT_CREDIT_SUM_SUM', 'BURO_AMT_CREDIT_SUM_DEBT_MAX',
       'BURO_AMT_CREDIT_SUM_DEBT_MEAN', 'BURO_AMT_CREDIT_SUM_DEBT_SUM',
       'BURO_AMT_CREDIT_SUM_OVERDUE_MEAN', 'BURO_AMT_CREDIT_SUM_LIMIT_MEAN',
       'BURO_AMT_CREDIT_SUM_LIMIT_SUM', 'BURO_AMT_ANNUITY_MAX',
       'BURO_AMT_ANNUITY_MEAN', 'BURO_CNT_CREDIT_PROLONG_SUM',
       'BURO_MONTHS_BALANCE_MIN_MIN', 'BURO_MONTHS_BALANCE_MAX_MAX',
       'BURO_MONTHS_BALANCE_SIZE_MEAN', 'BURO_MONTHS_BALANCE_SIZE_SUM',
       'BURO_CREDIT_ACTIVE_Active_MEAN', 'BURO_CREDIT_ACTIVE_Bad debt_MEAN',
       'BURO_CREDIT_ACTIVE_Closed_MEAN', 'BURO_CREDIT_ACTIVE_Sold_MEAN',
       'BURO_CREDIT_ACTIVE_nan_MEAN', 'BURO_CREDIT_CURRENCY_currency 1_MEAN',
       'BURO_CREDIT_CURRENCY_currency 2_MEAN',
       'BURO_CREDIT_CURRENCY_currency 3_MEAN',
       'BURO_CREDIT_CURRENCY_currency 4_MEAN', 'BURO_CREDIT_CURRENCY_nan_MEAN',
       'BURO_CREDIT_TYPE_Another type of loan_MEAN',
       'BURO_CREDIT_TYPE_Car loan_MEAN',
       'BURO_CREDIT_TYPE_Cash loan (non-earmarked)_MEAN',
       'BURO_CREDIT_TYPE_Consumer credit_MEAN',
       'BURO_CREDIT_TYPE_Credit card_MEAN',
       'BURO_CREDIT_TYPE_Interbank credit_MEAN',
       'BURO_CREDIT_TYPE_Loan for business development_MEAN',
       'BURO_CREDIT_TYPE_Loan for purchase of shares (margin lending)_MEAN',
       'BURO_CREDIT_TYPE_Loan for the purchase of equipment_MEAN',
       'BURO_CREDIT_TYPE_Loan for working capital replenishment_MEAN',
       'BURO_CREDIT_TYPE_Microloan_MEAN',
       'BURO_CREDIT_TYPE_Mobile operator loan_MEAN',
       'BURO_CREDIT_TYPE_Mortgage_MEAN',
       'BURO_CREDIT_TYPE_Real estate loan_MEAN',
       'BURO_CREDIT_TYPE_Unknown type of loan_MEAN',
       'BURO_CREDIT_TYPE_nan_MEAN', 'BURO_STATUS_0_MEAN_MEAN',
       'BURO_STATUS_1_MEAN_MEAN', 'BURO_STATUS_2_MEAN_MEAN',
       'BURO_STATUS_3_MEAN_MEAN', 'BURO_STATUS_4_MEAN_MEAN',
       'BURO_STATUS_5_MEAN_MEAN', 'BURO_STATUS_C_MEAN_MEAN',
       'BURO_STATUS_X_MEAN_MEAN', 'BURO_STATUS_nan_MEAN_MEAN'],
      dtype='object')

# 7.9 No duplicate index
bureau_agg.index.nunique()   # 305811
len(set(bureau_agg.index))   # 305811

More Aggregation and merger¶

We now filter bureau on CREDIT_ACTIVE_Active feature. This will create two subsets of data. This feature has values of 1 and 0.
Filter data where CREDIT_ACTIVE_Active value is 1. Then aggregate(only) numeric features of this filtered data-subset by grouping on SK_ID_CURR. Next, filter, bureau, on CREDIT_ACTIVE_Closed = 1 . And again aggregate the subset on numeric features. Merge all these with bureau_agg (NOT bureau.)

It is as if we are trying to extract the behaviour of those whose credits are active and those whose credits are closed.

# 8.0 In which cases credit is active? Filter data
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
active.head()
active.shape   # (630607, 51)

	SK_ID_CURR	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	...	MONTHS_BALANCE_SIZE	STATUS_0_MEAN	STATUS_1_MEAN	STATUS_2_MEAN	STATUS_3_MEAN	STATUS_4_MEAN	STATUS_5_MEAN	STATUS_C_MEAN	STATUS_X_MEAN	STATUS_nan_MEAN
1	215354	-208	1075.0	NaN	NaN	225000.0	171342.00	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	215354	-203	528.0	NaN	NaN	464323.5	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	215354	-203	NaN	NaN	NaN	90000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	215354	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	215354	-273	27460.0	NaN	0.0	180000.0	71017.38	108982.62	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 51 columns

(630607, 51)

# 8.1 Aggregate numercial features of the filtered subset over SK_ID_CURR
active_agg = active.groupby('SK_ID_CURR').agg(num_aggregations)

# 8.1.1
active_agg.head()
active_agg.shape   # (251815, 27)

	DAYS_CREDIT				DAYS_CREDIT_ENDDATE			DAYS_CREDIT_UPDATE	CREDIT_DAY_OVERDUE		...	AMT_CREDIT_SUM_OVERDUE	AMT_CREDIT_SUM_LIMIT		AMT_ANNUITY		CNT_CREDIT_PROLONG	MONTHS_BALANCE_MIN	MONTHS_BALANCE_MAX	MONTHS_BALANCE_SIZE
	min	max	mean	var	min	max	mean	mean	max	mean	...	mean	mean	sum	max	mean	sum	min	max	mean	sum
SK_ID_CURR
100001	-559	-49	-309.333333	65110.333333	411.0	1778.0	1030.333333	-10.666667	0	0.0	...	0.0	0.0000	0.000	10822.5	8272.50	0	-18.0	0.0	10.666667	32.0
100002	-1042	-103	-572.500000	440860.500000	780.0	780.0	780.000000	-15.500000	0	0.0	...	0.0	15994.2825	31988.565	0.0	0.00	0	-34.0	0.0	10.000000	20.0
100003	-606	-606	-606.000000	NaN	1216.0	1216.0	1216.000000	-43.000000	0	0.0	...	0.0	810000.0000	810000.000	NaN	NaN	0	NaN	NaN	NaN	0.0
100005	-137	-62	-99.500000	2812.500000	122.0	1324.0	723.000000	-21.000000	0	0.0	...	0.0	0.0000	0.000	4261.5	2130.75	0	-4.0	0.0	4.000000	8.0
100008	-78	-78	-78.000000	NaN	471.0	471.0	471.000000	-16.000000	0	0.0	...	0.0	0.0000	0.000	NaN	NaN	0	NaN	NaN	NaN	0.0

5 rows × 27 columns

(251815, 27)

# 8.1.2 Rename multi-indexed columns
active_agg.columns = pd.Index(['ACTIVE_' + e[0] + "_" + e[1].upper() for e in active_agg.columns.tolist()])
active_agg.columns

Index(['ACTIVE_DAYS_CREDIT_MIN', 'ACTIVE_DAYS_CREDIT_MAX',
       'ACTIVE_DAYS_CREDIT_MEAN', 'ACTIVE_DAYS_CREDIT_VAR',
       'ACTIVE_DAYS_CREDIT_ENDDATE_MIN', 'ACTIVE_DAYS_CREDIT_ENDDATE_MAX',
       'ACTIVE_DAYS_CREDIT_ENDDATE_MEAN', 'ACTIVE_DAYS_CREDIT_UPDATE_MEAN',
       'ACTIVE_CREDIT_DAY_OVERDUE_MAX', 'ACTIVE_CREDIT_DAY_OVERDUE_MEAN',
       'ACTIVE_AMT_CREDIT_MAX_OVERDUE_MEAN', 'ACTIVE_AMT_CREDIT_SUM_MAX',
       'ACTIVE_AMT_CREDIT_SUM_MEAN', 'ACTIVE_AMT_CREDIT_SUM_SUM',
       'ACTIVE_AMT_CREDIT_SUM_DEBT_MAX', 'ACTIVE_AMT_CREDIT_SUM_DEBT_MEAN',
       'ACTIVE_AMT_CREDIT_SUM_DEBT_SUM', 'ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN',
       'ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN', 'ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM',
       'ACTIVE_AMT_ANNUITY_MAX', 'ACTIVE_AMT_ANNUITY_MEAN',
       'ACTIVE_CNT_CREDIT_PROLONG_SUM', 'ACTIVE_MONTHS_BALANCE_MIN_MIN',
       'ACTIVE_MONTHS_BALANCE_MAX_MAX', 'ACTIVE_MONTHS_BALANCE_SIZE_MEAN',
       'ACTIVE_MONTHS_BALANCE_SIZE_SUM'],
      dtype='object')

# 9.0 Difference between length of two datasets
active_agg_set = set(active_agg.index)
bureau_agg_set = set(bureau_agg.index)
len(bureau_agg_set)     # 305811
len(active_agg_set)     # 251815
list(bureau_agg_set - active_agg_set)[:4]   # Few examples: {131074, 393220, 262149, 262153]

[131074, 393220, 262149, 262153]

# 9.1 Merge bureau_agg with active_agg over 'SK_ID_CURR'
bureau_agg = bureau_agg.join(
                             active_agg,
                             how='left',
                             on='SK_ID_CURR'
                             )

# 9.2 
bureau_agg.shape    # (305811, 89)

(305811, 89)

# 9.3 Obviouly some rows will hold NaN values for merged columns
bureau_agg.loc[[131074,393220,262149, 262153]]

	BURO_DAYS_CREDIT_MIN	BURO_DAYS_CREDIT_MAX	BURO_DAYS_CREDIT_MEAN	BURO_DAYS_CREDIT_VAR	BURO_DAYS_CREDIT_ENDDATE_MIN	BURO_DAYS_CREDIT_ENDDATE_MAX	BURO_DAYS_CREDIT_ENDDATE_MEAN	BURO_DAYS_CREDIT_UPDATE_MEAN	BURO_CREDIT_DAY_OVERDUE_MAX	BURO_CREDIT_DAY_OVERDUE_MEAN	...	ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN	ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN	ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM	ACTIVE_AMT_ANNUITY_MAX	ACTIVE_AMT_ANNUITY_MEAN	ACTIVE_CNT_CREDIT_PROLONG_SUM	ACTIVE_MONTHS_BALANCE_MIN_MIN	ACTIVE_MONTHS_BALANCE_MAX_MAX	ACTIVE_MONTHS_BALANCE_SIZE_MEAN	ACTIVE_MONTHS_BALANCE_SIZE_SUM
SK_ID_CURR
131074	-2419	-2419	-2419.0	NaN	-1301.0	-1301.0	-1301.0	-1707.0	0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
393220	-478	-478	-478.0	NaN	NaN	NaN	NaN	-288.0	0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
262149	-373	-373	-373.0	NaN	-251.0	-251.0	-251.0	-243.0	0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
262153	-1677	-1677	-1677.0	NaN	-1589.0	-1589.0	-1589.0	-1012.0	0	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

4 rows × 89 columns

# 9.4 Release memory
del active, active_agg
gc.collect()

# 10.0 Same steps for the  CREDIT_ACTIVE_Closed =1 cases
#     Bureau: Closed credits - using only numerical aggregations
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
closed_agg = closed.groupby('SK_ID_CURR').agg(num_aggregations)
closed_agg.columns = pd.Index(['CLOSED_' + e[0] + "_" + e[1].upper() for e in closed_agg.columns.tolist()])
bureau_agg = bureau_agg.join(closed_agg, how='left', on='SK_ID_CURR')

# 10.1
bureau_agg.shape   # (305811, 116)

(305811, 116)

# 10.2
del closed, closed_agg, bureau
gc.collect()

# 10.3 SK_ID_CURR is index. Index is also saved by-default.
bureau_agg.to_csv("processed_bureau_agg.csv.zip", compression = "zip")

##################

Home Credit Default Risk Assessment

Bureau and Bureau Balance data¶

Feature explanations¶

Bureau table¶

Bureau Balance table¶

Aggregation¶

bureau_balance¶

Performing aggregations in bb¶

Performing aggregations in bureau¶

More Aggregation and merger¶