Bureau and Bureau Balance data

*bureau.csv* data concerns client's earlier credits from other financial institutions. Some of the credits may be active and some are closed. Each previous (or ongoing) credit has its own row (only one row per credit) in *bureau* dataset. As a single client might have taken other loans from other financial institutions, for each row in the *application_train* data (ie *application_train.csv*) we can have multiple rows in this table. Feature explanations for this dataset are as below.

Feature explanations

Bureau table

SK_ID_CURR: ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau
SK_BUREAU_ID: Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application)
CREDIT_ACTIVE: Status of the Credit Bureau (CB) reported credits
CREDIT_CURRENCY: Recoded currency of the Credit Bureau credit
DAYS_CREDIT: How many days before current application did client apply for Credit Bureau credit
CREDIT_DAY_OVERDUE: Number of days past due on CB credit at the time of application for related loan in our sample
DAYS_CREDIT_ENDDATE: Remaining duration of CB credit (in days) at the time of application in Home Credit
DAYS_ENDDATE_FACT: Days since CB credit ended at the time of application in Home Credit (only for closed credit)
AMT_CREDIT_MAX_OVERDUE: Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample)
CNT_CREDIT_PROLONG: How many times was the Credit Bureau credit prolonged
AMT_CREDIT_SUM: Current credit amount for the Credit Bureau credit
AMT_CREDIT_SUM_DEBT: Current debt on Credit Bureau credit
AMT_CREDIT_SUM_LIMIT: Current credit limit of credit card reported in Credit Bureau
AMT_CREDIT_SUM_OVERDUE: Current amount overdue on Credit Bureau credit
CREDIT_TYPE: Type of Credit Bureau credit (Car, cash,...)
DAYS_CREDIT_UPDATE: How many days before loan application did last information about the Credit Bureau credit come
AMT_ANNUITY: Annuity of the Credit Bureau credit

Bureau Balance table

SK_BUREAU_ID: Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date) time only relative to the application
STATUS: Status of Credit Bureau loan during the month
# Last amended: 21st October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing bureau and bureau_balance datasets.
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 One-hot encoding function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 3.1 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies
# 3.2 Read bureau data first
bureau = pd.read_csv(
                     'bureau.csv.zip',
                     nrows = None    # Read all rows
                    )

# 3.2.1 Reduce memory usage by appropriately
#       changing data-types per feature:

bureau = reducing.Reducer().reduce(bureau)
reduced df from 222.6203 MB to 165.3284 MB in 1.56 seconds
# 3.2.2 Explore data now
bureau.head(5)
bureau.shape   # (rows:17,16,428, cols: 17)
bureau.dtypes
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
(1716428, 17)
SK_ID_CURR                uint32 
SK_ID_BUREAU              uint32 
CREDIT_ACTIVE             object 
CREDIT_CURRENCY           object 
DAYS_CREDIT               int16  
CREDIT_DAY_OVERDUE        uint16 
DAYS_CREDIT_ENDDATE       float64
DAYS_ENDDATE_FACT         float64
AMT_CREDIT_MAX_OVERDUE    float64
CNT_CREDIT_PROLONG        uint8  
AMT_CREDIT_SUM            float64
AMT_CREDIT_SUM_DEBT       float64
AMT_CREDIT_SUM_LIMIT      float64
AMT_CREDIT_SUM_OVERDUE    float32
CREDIT_TYPE               object 
DAYS_CREDIT_UPDATE        int32  
AMT_ANNUITY               float64
dtype: object
# 3.2.3 In all, how many are categoricals?
bureau.dtypes.value_counts()
float64    7
object     3
uint32     2
float32    1
int32      1
uint16     1
uint8      1
int16      1
dtype: int64
# 3.3
bureau.shape                       # (1716428, 17)

# 3.3.1
# What is the actual number of persons
#  who might have taken multiple loans?

bureau['SK_ID_CURR'].nunique()     # 305811  -- Many duplicate values exist
                                   #            Consider SK_ID_CURR as Foreign Key
                                   #            Primary key exists in application_train data
                                   # Primary key: SK_ID_BUREAU
            
# 3.3.2
# As expected, there are no duplicate values here
bureau['SK_ID_BUREAU'].nunique()   # 1716428 -- Unique id for each row 
(1716428, 17)
305811
1716428
# 3.4 Summary of active/closed cases from bureau
# We aggregate on these also
bureau['CREDIT_ACTIVE'].value_counts()
Closed      1079273
Active      630607 
Sold        6527   
Bad debt    21     
Name: CREDIT_ACTIVE, dtype: int64

Aggregation

bureau_balance will be aggregated and merged with bureau. bureau will then be aggregated and merged with 'application_train' data. bureau will be aggregated in three different ways. This aggregation will be by SK_ID_CURR. Finally, aggregated bureau, called bureau_agg, will be merged with 'application_train' over (SK_ID_CURR).
Aggregation over time is one way to extract behaviour of client. All categorical data is first OneHotEncoded (OHE). What is unique about this OHE is that NaN values are treated as categories.
# 4.0 OneHotEncode 'object' types in bureau
bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category)
# 4.1
bureau.head()
bureau.shape          # (1716428, 40); 17-->40
print(bureau_cat)     # List of added columns
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT ... CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan CREDIT_TYPE_nan
0 215354 5714462 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
1 215354 5714463 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 ... 0 0 0 0 0 0 0 0 0 0
2 215354 5714464 -203 0 528.0 NaN NaN 0 464323.5 NaN ... 0 0 0 0 0 0 0 0 0 0
3 215354 5714465 -203 0 NaN NaN NaN 0 90000.0 NaN ... 0 0 0 0 0 0 0 0 0 0
4 215354 5714466 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN ... 0 0 0 0 0 0 0 0 0 0

5 rows × 40 columns

(1716428, 40)
['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt', 'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan', 'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2', 'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4', 'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)', 'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card', 'CREDIT_TYPE_Interbank credit', 'CREDIT_TYPE_Loan for business development', 'CREDIT_TYPE_Loan for purchase of shares (margin lending)', 'CREDIT_TYPE_Loan for the purchase of equipment', 'CREDIT_TYPE_Loan for working capital replenishment', 'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan', 'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan', 'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan']

bureau_balance

It is monthly data about the remaining balance of each one of the previous credits of clients that exist in dataset bureau. Each previous credit is identified by a unique ID, SK_ID_BUREAU, in dataset bureau. Each row in bureau_balance is one month of credit-due (from previous credit), and a single previous credit can have multiple rows, one for each month of the credit length.
In my personal view, it should be in decreasing order. That is, for every person identified by SK_ID_BUREAU, credits should be decreasing each passing month.
# 5.0 Read over bureau_balance data
#     and reduce memory usage through
#     conversion of data-types:

bb = pd.read_csv('bureau_balance.csv.zip', nrows = None)
bb = reducing.Reducer().reduce(bb)
reduced df from 624.8458 MB to 338.4582 MB in 4.78 seconds
# 5.0.1 Display few rows 
bb.head(10)

# 5.0.2 & Compare
bb.shape      # (27299925, 3) 
bureau.shape  # (1716428, 17)
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
5 5715448 -5 C
6 5715448 -6 C
7 5715448 -7 C
8 5715448 -8 C
9 5715448 -9 0
(27299925, 3)
(1716428, 40)
# 5.1 There is just one 'object' column
bb.dtypes.value_counts()
object    1
int8      1
uint32    1
dtype: int64
# 5.2 Is the data about all bureau cases?
#      No, it appears it is not for all cases

bb['SK_ID_BUREAU'].nunique()    # 817395 << 1716428
817395
# 5.3 Just which cases are present in 'bureau' but absent
#     in 'bb'
bb_id_set = set(bb['SK_ID_BUREAU'])             # Set of IDs in bb
bureau_id_set = set(bureau['SK_ID_BUREAU'])     # Set of IDs in bureau
# 5.4 And here is the difference list.
#      How many of them? 
list(bureau_id_set - bb_id_set)[:5]      # sample [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]
len(bureau_id_set - bb_id_set) # 942074
[6292791, 6292792, 6292793, 6292795, 6292796]
942074
# 5.5 OK. So let us OneHotEncode bb
bb, bb_cat = one_hot_encoder(bb, nan_as_category)
# 5.6 Examine the results
bb.head()
bb.shape   # (27299925, 11) ; 3-->11
           # 1 (ID) + 1 (numeric) + 9 (dummy)
bb_cat     # New columns added
SK_ID_BUREAU MONTHS_BALANCE STATUS_0 STATUS_1 STATUS_2 STATUS_3 STATUS_4 STATUS_5 STATUS_C STATUS_X STATUS_nan
0 5715448 0 0 0 0 0 0 0 1 0 0
1 5715448 -1 0 0 0 0 0 0 1 0 0
2 5715448 -2 0 0 0 0 0 0 1 0 0
3 5715448 -3 0 0 0 0 0 0 1 0 0
4 5715448 -4 0 0 0 0 0 0 1 0 0
(27299925, 11)
['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']

Performing aggregations in bb

There is one numeric feature: 'MONTHS_BALANCE'. On this feature we will perform ['min', 'max', 'size']. And on the rest of the features,dummy features, we will perform [mean]. Aggregation is by unique bureau ID, SK_ID_BUREAU. Resulting dataset is called bureau_agg.
# 6.0 Bureau balance: Perform aggregations and merge with bureau.csv
#     First prepare a dictionary listing operations to be performed
#     on various features:

bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size']}
for col in bb_cat:
    bb_aggregations[col] = ['mean']

# 6.0.1    
len(bb_aggregations)     # 10  
10
# 6.1 So what all aggregations to perform column-wise

bb_aggregations
{'MONTHS_BALANCE': ['min', 'max', 'size'],
 'STATUS_0': ['mean'],
 'STATUS_1': ['mean'],
 'STATUS_2': ['mean'],
 'STATUS_3': ['mean'],
 'STATUS_4': ['mean'],
 'STATUS_5': ['mean'],
 'STATUS_C': ['mean'],
 'STATUS_X': ['mean'],
 'STATUS_nan': ['mean']}
# 6.2 Perform aggregations now in bb:

grouped =  bb.groupby('SK_ID_BUREAU')
bb_agg = bb.groupby('SK_ID_BUREAU').agg(bb_aggregations)
# 6.3
bb_agg.shape      # (817395, 12)
bb_agg.columns

# 6.3.1 Note that 'SK_ID_BUREAU'
#       the grouping column is
#       now table-index

bb_agg.head()
(817395, 12)
MultiIndex([('MONTHS_BALANCE',  'min'),
            ('MONTHS_BALANCE',  'max'),
            ('MONTHS_BALANCE', 'size'),
            (      'STATUS_0', 'mean'),
            (      'STATUS_1', 'mean'),
            (      'STATUS_2', 'mean'),
            (      'STATUS_3', 'mean'),
            (      'STATUS_4', 'mean'),
            (      'STATUS_5', 'mean'),
            (      'STATUS_C', 'mean'),
            (      'STATUS_X', 'mean'),
            (    'STATUS_nan', 'mean')],
           )
MONTHS_BALANCE STATUS_0 STATUS_1 STATUS_2 STATUS_3 STATUS_4 STATUS_5 STATUS_C STATUS_X STATUS_nan
min max size mean mean mean mean mean mean mean mean mean
SK_ID_BUREAU
5001709 -96 0 97 0.000000 0.0 0.0 0.0 0.0 0.0 0.886598 0.113402 0
5001710 -82 0 83 0.060241 0.0 0.0 0.0 0.0 0.0 0.578313 0.361446 0
5001711 -3 0 4 0.750000 0.0 0.0 0.0 0.0 0.0 0.000000 0.250000 0
5001712 -18 0 19 0.526316 0.0 0.0 0.0 0.0 0.0 0.473684 0.000000 0
5001713 -21 0 22 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 1.000000 0
# 6.4 Rename bb_agg columns
bb_agg.columns = pd.Index([e[0] + "_" + e[1].upper() for e in bb_agg.columns.tolist()])
# 6.4.1
bb_agg.columns.tolist()
bb_agg.head()
['MONTHS_BALANCE_MIN',
 'MONTHS_BALANCE_MAX',
 'MONTHS_BALANCE_SIZE',
 'STATUS_0_MEAN',
 'STATUS_1_MEAN',
 'STATUS_2_MEAN',
 'STATUS_3_MEAN',
 'STATUS_4_MEAN',
 'STATUS_5_MEAN',
 'STATUS_C_MEAN',
 'STATUS_X_MEAN',
 'STATUS_nan_MEAN']
MONTHS_BALANCE_MIN MONTHS_BALANCE_MAX MONTHS_BALANCE_SIZE STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
SK_ID_BUREAU
5001709 -96 0 97 0.000000 0.0 0.0 0.0 0.0 0.0 0.886598 0.113402 0
5001710 -82 0 83 0.060241 0.0 0.0 0.0 0.0 0.0 0.578313 0.361446 0
5001711 -3 0 4 0.750000 0.0 0.0 0.0 0.0 0.0 0.000000 0.250000 0
5001712 -18 0 19 0.526316 0.0 0.0 0.0 0.0 0.0 0.473684 0.000000 0
5001713 -21 0 22 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 1.000000 0
# 6.5 Merge aggregated bb with bureau

bureau = bureau.join(
                     bb_agg,
                     how='left',
                     on='SK_ID_BUREAU'
                    )
# 6.5.1

bureau.head()
bureau.shape   # (1716428, 52)
bureau.dtypes  
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT ... MONTHS_BALANCE_SIZE STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
0 215354 5714462 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 215354 5714463 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 215354 5714464 -203 0 528.0 NaN NaN 0 464323.5 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 215354 5714465 -203 0 NaN NaN NaN 0 90000.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 215354 5714466 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 52 columns

(1716428, 52)
SK_ID_CURR                                                  uint32 
SK_ID_BUREAU                                                uint32 
DAYS_CREDIT                                                 int16  
CREDIT_DAY_OVERDUE                                          uint16 
DAYS_CREDIT_ENDDATE                                         float64
DAYS_ENDDATE_FACT                                           float64
AMT_CREDIT_MAX_OVERDUE                                      float64
CNT_CREDIT_PROLONG                                          uint8  
AMT_CREDIT_SUM                                              float64
AMT_CREDIT_SUM_DEBT                                         float64
AMT_CREDIT_SUM_LIMIT                                        float64
AMT_CREDIT_SUM_OVERDUE                                      float32
DAYS_CREDIT_UPDATE                                          int32  
AMT_ANNUITY                                                 float64
CREDIT_ACTIVE_Active                                        uint8  
CREDIT_ACTIVE_Bad debt                                      uint8  
CREDIT_ACTIVE_Closed                                        uint8  
CREDIT_ACTIVE_Sold                                          uint8  
CREDIT_ACTIVE_nan                                           uint8  
CREDIT_CURRENCY_currency 1                                  uint8  
CREDIT_CURRENCY_currency 2                                  uint8  
CREDIT_CURRENCY_currency 3                                  uint8  
CREDIT_CURRENCY_currency 4                                  uint8  
CREDIT_CURRENCY_nan                                         uint8  
CREDIT_TYPE_Another type of loan                            uint8  
CREDIT_TYPE_Car loan                                        uint8  
CREDIT_TYPE_Cash loan (non-earmarked)                       uint8  
CREDIT_TYPE_Consumer credit                                 uint8  
CREDIT_TYPE_Credit card                                     uint8  
CREDIT_TYPE_Interbank credit                                uint8  
CREDIT_TYPE_Loan for business development                   uint8  
CREDIT_TYPE_Loan for purchase of shares (margin lending)    uint8  
CREDIT_TYPE_Loan for the purchase of equipment              uint8  
CREDIT_TYPE_Loan for working capital replenishment          uint8  
CREDIT_TYPE_Microloan                                       uint8  
CREDIT_TYPE_Mobile operator loan                            uint8  
CREDIT_TYPE_Mortgage                                        uint8  
CREDIT_TYPE_Real estate loan                                uint8  
CREDIT_TYPE_Unknown type of loan                            uint8  
CREDIT_TYPE_nan                                             uint8  
MONTHS_BALANCE_MIN                                          float64
MONTHS_BALANCE_MAX                                          float64
MONTHS_BALANCE_SIZE                                         float64
STATUS_0_MEAN                                               float64
STATUS_1_MEAN                                               float64
STATUS_2_MEAN                                               float64
STATUS_3_MEAN                                               float64
STATUS_4_MEAN                                               float64
STATUS_5_MEAN                                               float64
STATUS_C_MEAN                                               float64
STATUS_X_MEAN                                               float64
STATUS_nan_MEAN                                             float64
dtype: object
# 6.5.2 Just for curiosity, what happened
#       to those rows in 'bureau' where there
#       was no matching record in bb_agg. The list
#       of such IDs is:
#       [6292791,6292792,6292793,6292795,6292796,6292797,6292798,6292799]

bureau[bureau['SK_ID_BUREAU'] ==6292791]
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT ... MONTHS_BALANCE_SIZE STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
941854 190757 6292791 -707 0 -300.0 -300.0 0.0 0 2250000.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 52 columns

# 6.6 Drop SK_ID_BUREAU as bb has finally merged.

bureau.drop(['SK_ID_BUREAU'],
            axis=1,
            inplace= True
           )
# We have three types of columns
# Categorical columns generated from bureau
# Categorical columns generated from bb
# Numerical columns

Performing aggregations in bureau

Aggregate 14 original numeric columns, as: ['min', 'max', 'mean', 'var']
Aggregate rest of the columns that is dummy columns as: [mean].
This constitutes one of the three aggretaions. Aggregation is by SK_ID_CURR. Resulting dataset is called bureau_agg
# 7.0 Have a look at bureau again.
#      SK_ID_CURR repeats for many cases.
#         So, there is a case for aggregation

bureau.shape     # (1716428, 51)
bureau.head()
(1716428, 51)
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT ... MONTHS_BALANCE_SIZE STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
0 215354 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 215354 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 215354 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 215354 -203 0 NaN NaN NaN 0 90000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 215354 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 51 columns

## Aggregation strategy
# 7.1 Numeric features
#     Columns: Bureau + bureau_balance numeric features
#              Last three columns are from bureau_balance
#              Total: 11 + 3 = 14

num_aggregations = {
                     'DAYS_CREDIT':             ['min', 'max', 'mean', 'var'],
                     'DAYS_CREDIT_ENDDATE':     ['min', 'max', 'mean'],
                     'DAYS_CREDIT_UPDATE':      ['mean'],
                     'CREDIT_DAY_OVERDUE':      ['max', 'mean'],
                     'AMT_CREDIT_MAX_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM':          ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_DEBT':     ['max', 'mean', 'sum'],
                     'AMT_CREDIT_SUM_OVERDUE':  ['mean'],
                     'AMT_CREDIT_SUM_LIMIT':    ['mean', 'sum'],
                     'AMT_ANNUITY':             ['max', 'mean'],
                     'CNT_CREDIT_PROLONG':      ['sum'],
                     'MONTHS_BALANCE_MIN':      ['min'],
                     'MONTHS_BALANCE_MAX':      ['max'],
                     'MONTHS_BALANCE_SIZE':     ['mean', 'sum']
                   }

len(num_aggregations)   # 14
14
# 7.2 Bureau categorical features. Derived from:
#       'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE', 
#        Total: 

cat_aggregations = {}
bureau_cat      # bureau_cat are newly created dummy columns
                #  but all are numerical columns

# 7.2.1    
len(bureau_cat) # 26    
['CREDIT_ACTIVE_Active',
 'CREDIT_ACTIVE_Bad debt',
 'CREDIT_ACTIVE_Closed',
 'CREDIT_ACTIVE_Sold',
 'CREDIT_ACTIVE_nan',
 'CREDIT_CURRENCY_currency 1',
 'CREDIT_CURRENCY_currency 2',
 'CREDIT_CURRENCY_currency 3',
 'CREDIT_CURRENCY_currency 4',
 'CREDIT_CURRENCY_nan',
 'CREDIT_TYPE_Another type of loan',
 'CREDIT_TYPE_Car loan',
 'CREDIT_TYPE_Cash loan (non-earmarked)',
 'CREDIT_TYPE_Consumer credit',
 'CREDIT_TYPE_Credit card',
 'CREDIT_TYPE_Interbank credit',
 'CREDIT_TYPE_Loan for business development',
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
 'CREDIT_TYPE_Loan for the purchase of equipment',
 'CREDIT_TYPE_Loan for working capital replenishment',
 'CREDIT_TYPE_Microloan',
 'CREDIT_TYPE_Mobile operator loan',
 'CREDIT_TYPE_Mortgage',
 'CREDIT_TYPE_Real estate loan',
 'CREDIT_TYPE_Unknown type of loan',
 'CREDIT_TYPE_nan']
26
# 7.2.2 For all these new dummy columns in bureau, we will
#       take mean
for cat in bureau_cat: cat_aggregations[cat] = ['mean']
cat_aggregations    

len(cat_aggregations)   # 26
{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 'CREDIT_TYPE_Real estate loan': ['mean'],
 'CREDIT_TYPE_Unknown type of loan': ['mean'],
 'CREDIT_TYPE_nan': ['mean']}
26
# 7.3.1 In addition, we have in bureau. columns that merged
#        from 'bb' ie bb_cat
#         So here is our full list
bb_cat
len(bb_cat)             # 9

# 7.3.2
for cat in bb_cat: cat_aggregations[cat + "_MEAN"] = ['mean']
cat_aggregations 

len(cat_aggregations)   # 26 + 9 = 35
['STATUS_0',
 'STATUS_1',
 'STATUS_2',
 'STATUS_3',
 'STATUS_4',
 'STATUS_5',
 'STATUS_C',
 'STATUS_X',
 'STATUS_nan']
9
{'CREDIT_ACTIVE_Active': ['mean'],
 'CREDIT_ACTIVE_Bad debt': ['mean'],
 'CREDIT_ACTIVE_Closed': ['mean'],
 'CREDIT_ACTIVE_Sold': ['mean'],
 'CREDIT_ACTIVE_nan': ['mean'],
 'CREDIT_CURRENCY_currency 1': ['mean'],
 'CREDIT_CURRENCY_currency 2': ['mean'],
 'CREDIT_CURRENCY_currency 3': ['mean'],
 'CREDIT_CURRENCY_currency 4': ['mean'],
 'CREDIT_CURRENCY_nan': ['mean'],
 'CREDIT_TYPE_Another type of loan': ['mean'],
 'CREDIT_TYPE_Car loan': ['mean'],
 'CREDIT_TYPE_Cash loan (non-earmarked)': ['mean'],
 'CREDIT_TYPE_Consumer credit': ['mean'],
 'CREDIT_TYPE_Credit card': ['mean'],
 'CREDIT_TYPE_Interbank credit': ['mean'],
 'CREDIT_TYPE_Loan for business development': ['mean'],
 'CREDIT_TYPE_Loan for purchase of shares (margin lending)': ['mean'],
 'CREDIT_TYPE_Loan for the purchase of equipment': ['mean'],
 'CREDIT_TYPE_Loan for working capital replenishment': ['mean'],
 'CREDIT_TYPE_Microloan': ['mean'],
 'CREDIT_TYPE_Mobile operator loan': ['mean'],
 'CREDIT_TYPE_Mortgage': ['mean'],
 'CREDIT_TYPE_Real estate loan': ['mean'],
 'CREDIT_TYPE_Unknown type of loan': ['mean'],
 'CREDIT_TYPE_nan': ['mean'],
 'STATUS_0_MEAN': ['mean'],
 'STATUS_1_MEAN': ['mean'],
 'STATUS_2_MEAN': ['mean'],
 'STATUS_3_MEAN': ['mean'],
 'STATUS_4_MEAN': ['mean'],
 'STATUS_5_MEAN': ['mean'],
 'STATUS_C_MEAN': ['mean'],
 'STATUS_X_MEAN': ['mean'],
 'STATUS_nan_MEAN': ['mean']}
35
# 7.4 Have a look at bureau columns again
#      Just to compare above results with what
#       already exists

bureau.columns        # 51
len(bureau.columns)   # 35 (dummy) + 14 (num) + 1 (SK_ID_CURR) + 1 (DAYS_ENDDATE_FACT) = 51
Index(['SK_ID_CURR', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
       'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
       'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
       'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE', 'DAYS_CREDIT_UPDATE',
       'AMT_ANNUITY', 'CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold', 'CREDIT_ACTIVE_nan',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_CURRENCY_nan', 'CREDIT_TYPE_Another type of loan',
       'CREDIT_TYPE_Car loan', 'CREDIT_TYPE_Cash loan (non-earmarked)',
       'CREDIT_TYPE_Consumer credit', 'CREDIT_TYPE_Credit card',
       'CREDIT_TYPE_Interbank credit',
       'CREDIT_TYPE_Loan for business development',
       'CREDIT_TYPE_Loan for purchase of shares (margin lending)',
       'CREDIT_TYPE_Loan for the purchase of equipment',
       'CREDIT_TYPE_Loan for working capital replenishment',
       'CREDIT_TYPE_Microloan', 'CREDIT_TYPE_Mobile operator loan',
       'CREDIT_TYPE_Mortgage', 'CREDIT_TYPE_Real estate loan',
       'CREDIT_TYPE_Unknown type of loan', 'CREDIT_TYPE_nan',
       'MONTHS_BALANCE_MIN', 'MONTHS_BALANCE_MAX', 'MONTHS_BALANCE_SIZE',
       'STATUS_0_MEAN', 'STATUS_1_MEAN', 'STATUS_2_MEAN', 'STATUS_3_MEAN',
       'STATUS_4_MEAN', 'STATUS_5_MEAN', 'STATUS_C_MEAN', 'STATUS_X_MEAN',
       'STATUS_nan_MEAN'],
      dtype='object')
51
# 7.5 Now that we have decided 
#     our aggregation strategy for each column
#      (except 2), let us now aggregate:
#         Note that SK_ID_CURR now becomes an index to data

grouped = bureau.groupby('SK_ID_CURR')
bureau_agg = grouped.agg({**num_aggregations, **cat_aggregations})
# 7.6
bureau_agg.head()
bureau_agg.shape  # (305811, 62) (including newly created min, max etc columns)
DAYS_CREDIT DAYS_CREDIT_ENDDATE DAYS_CREDIT_UPDATE CREDIT_DAY_OVERDUE ... CREDIT_TYPE_nan STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
min max mean var min max mean mean max mean ... mean mean mean mean mean mean mean mean mean mean
SK_ID_CURR
100001 -1572 -49 -735.000000 240043.666667 -1329.0 1778.0 82.428571 -93.142857 0 0.0 ... 0 0.336651 0.007519 0.0 0.0 0.0 0.0 0.441240 0.214590 0.0
100002 -1437 -103 -874.000000 186150.000000 -1072.0 780.0 -349.000000 -499.875000 0 0.0 ... 0 0.406960 0.255682 0.0 0.0 0.0 0.0 0.175426 0.161932 0.0
100003 -2586 -606 -1400.750000 827783.583333 -2434.0 1216.0 -544.500000 -816.000000 0 0.0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
100004 -1326 -408 -867.000000 421362.000000 -595.0 -382.0 -488.500000 -532.000000 0 0.0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
100005 -373 -62 -190.666667 26340.333333 -128.0 1324.0 439.333333 -54.333333 0 0.0 ... 0 0.735043 0.000000 0.0 0.0 0.0 0.0 0.128205 0.136752 0.0

5 rows × 62 columns

(305811, 62)
# 7.7 Remove hierarchical index from bureau_agg
bureau_agg.columns       # 62 
bureau_agg.columns = pd.Index(['BURO_' + e[0] + "_" + e[1].upper() for e in bureau_agg.columns.tolist()])
MultiIndex([(                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                             'DAYS_CREDIT', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                     'DAYS_CREDIT_ENDDATE', ...),
            (                                      'DAYS_CREDIT_UPDATE', ...),
            (                                      'CREDIT_DAY_OVERDUE', ...),
            (                                      'CREDIT_DAY_OVERDUE', ...),
            (                                  'AMT_CREDIT_MAX_OVERDUE', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                          'AMT_CREDIT_SUM', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                     'AMT_CREDIT_SUM_DEBT', ...),
            (                                  'AMT_CREDIT_SUM_OVERDUE', ...),
            (                                    'AMT_CREDIT_SUM_LIMIT', ...),
            (                                    'AMT_CREDIT_SUM_LIMIT', ...),
            (                                             'AMT_ANNUITY', ...),
            (                                             'AMT_ANNUITY', ...),
            (                                      'CNT_CREDIT_PROLONG', ...),
            (                                      'MONTHS_BALANCE_MIN', ...),
            (                                      'MONTHS_BALANCE_MAX', ...),
            (                                     'MONTHS_BALANCE_SIZE', ...),
            (                                     'MONTHS_BALANCE_SIZE', ...),
            (                                    'CREDIT_ACTIVE_Active', ...),
            (                                  'CREDIT_ACTIVE_Bad debt', ...),
            (                                    'CREDIT_ACTIVE_Closed', ...),
            (                                      'CREDIT_ACTIVE_Sold', ...),
            (                                       'CREDIT_ACTIVE_nan', ...),
            (                              'CREDIT_CURRENCY_currency 1', ...),
            (                              'CREDIT_CURRENCY_currency 2', ...),
            (                              'CREDIT_CURRENCY_currency 3', ...),
            (                              'CREDIT_CURRENCY_currency 4', ...),
            (                                     'CREDIT_CURRENCY_nan', ...),
            (                        'CREDIT_TYPE_Another type of loan', ...),
            (                                    'CREDIT_TYPE_Car loan', ...),
            (                   'CREDIT_TYPE_Cash loan (non-earmarked)', ...),
            (                             'CREDIT_TYPE_Consumer credit', ...),
            (                                 'CREDIT_TYPE_Credit card', ...),
            (                            'CREDIT_TYPE_Interbank credit', ...),
            (               'CREDIT_TYPE_Loan for business development', ...),
            ('CREDIT_TYPE_Loan for purchase of shares (margin lending)', ...),
            (          'CREDIT_TYPE_Loan for the purchase of equipment', ...),
            (      'CREDIT_TYPE_Loan for working capital replenishment', ...),
            (                                   'CREDIT_TYPE_Microloan', ...),
            (                        'CREDIT_TYPE_Mobile operator loan', ...),
            (                                    'CREDIT_TYPE_Mortgage', ...),
            (                            'CREDIT_TYPE_Real estate loan', ...),
            (                        'CREDIT_TYPE_Unknown type of loan', ...),
            (                                         'CREDIT_TYPE_nan', ...),
            (                                           'STATUS_0_MEAN', ...),
            (                                           'STATUS_1_MEAN', ...),
            (                                           'STATUS_2_MEAN', ...),
            (                                           'STATUS_3_MEAN', ...),
            (                                           'STATUS_4_MEAN', ...),
            (                                           'STATUS_5_MEAN', ...),
            (                                           'STATUS_C_MEAN', ...),
            (                                           'STATUS_X_MEAN', ...),
            (                                         'STATUS_nan_MEAN', ...)],
           )
# 7.8
bureau_agg.head()
# 7.8.1 Note that SK_ID_CURR is now an index to table
bureau_agg.columns   # 62: Due to creation of min, max, var etc columns
BURO_DAYS_CREDIT_MIN BURO_DAYS_CREDIT_MAX BURO_DAYS_CREDIT_MEAN BURO_DAYS_CREDIT_VAR BURO_DAYS_CREDIT_ENDDATE_MIN BURO_DAYS_CREDIT_ENDDATE_MAX BURO_DAYS_CREDIT_ENDDATE_MEAN BURO_DAYS_CREDIT_UPDATE_MEAN BURO_CREDIT_DAY_OVERDUE_MAX BURO_CREDIT_DAY_OVERDUE_MEAN ... BURO_CREDIT_TYPE_nan_MEAN BURO_STATUS_0_MEAN_MEAN BURO_STATUS_1_MEAN_MEAN BURO_STATUS_2_MEAN_MEAN BURO_STATUS_3_MEAN_MEAN BURO_STATUS_4_MEAN_MEAN BURO_STATUS_5_MEAN_MEAN BURO_STATUS_C_MEAN_MEAN BURO_STATUS_X_MEAN_MEAN BURO_STATUS_nan_MEAN_MEAN
SK_ID_CURR
100001 -1572 -49 -735.000000 240043.666667 -1329.0 1778.0 82.428571 -93.142857 0 0.0 ... 0 0.336651 0.007519 0.0 0.0 0.0 0.0 0.441240 0.214590 0.0
100002 -1437 -103 -874.000000 186150.000000 -1072.0 780.0 -349.000000 -499.875000 0 0.0 ... 0 0.406960 0.255682 0.0 0.0 0.0 0.0 0.175426 0.161932 0.0
100003 -2586 -606 -1400.750000 827783.583333 -2434.0 1216.0 -544.500000 -816.000000 0 0.0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
100004 -1326 -408 -867.000000 421362.000000 -595.0 -382.0 -488.500000 -532.000000 0 0.0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
100005 -373 -62 -190.666667 26340.333333 -128.0 1324.0 439.333333 -54.333333 0 0.0 ... 0 0.735043 0.000000 0.0 0.0 0.0 0.0 0.128205 0.136752 0.0

5 rows × 62 columns

Index(['BURO_DAYS_CREDIT_MIN', 'BURO_DAYS_CREDIT_MAX', 'BURO_DAYS_CREDIT_MEAN',
       'BURO_DAYS_CREDIT_VAR', 'BURO_DAYS_CREDIT_ENDDATE_MIN',
       'BURO_DAYS_CREDIT_ENDDATE_MAX', 'BURO_DAYS_CREDIT_ENDDATE_MEAN',
       'BURO_DAYS_CREDIT_UPDATE_MEAN', 'BURO_CREDIT_DAY_OVERDUE_MAX',
       'BURO_CREDIT_DAY_OVERDUE_MEAN', 'BURO_AMT_CREDIT_MAX_OVERDUE_MEAN',
       'BURO_AMT_CREDIT_SUM_MAX', 'BURO_AMT_CREDIT_SUM_MEAN',
       'BURO_AMT_CREDIT_SUM_SUM', 'BURO_AMT_CREDIT_SUM_DEBT_MAX',
       'BURO_AMT_CREDIT_SUM_DEBT_MEAN', 'BURO_AMT_CREDIT_SUM_DEBT_SUM',
       'BURO_AMT_CREDIT_SUM_OVERDUE_MEAN', 'BURO_AMT_CREDIT_SUM_LIMIT_MEAN',
       'BURO_AMT_CREDIT_SUM_LIMIT_SUM', 'BURO_AMT_ANNUITY_MAX',
       'BURO_AMT_ANNUITY_MEAN', 'BURO_CNT_CREDIT_PROLONG_SUM',
       'BURO_MONTHS_BALANCE_MIN_MIN', 'BURO_MONTHS_BALANCE_MAX_MAX',
       'BURO_MONTHS_BALANCE_SIZE_MEAN', 'BURO_MONTHS_BALANCE_SIZE_SUM',
       'BURO_CREDIT_ACTIVE_Active_MEAN', 'BURO_CREDIT_ACTIVE_Bad debt_MEAN',
       'BURO_CREDIT_ACTIVE_Closed_MEAN', 'BURO_CREDIT_ACTIVE_Sold_MEAN',
       'BURO_CREDIT_ACTIVE_nan_MEAN', 'BURO_CREDIT_CURRENCY_currency 1_MEAN',
       'BURO_CREDIT_CURRENCY_currency 2_MEAN',
       'BURO_CREDIT_CURRENCY_currency 3_MEAN',
       'BURO_CREDIT_CURRENCY_currency 4_MEAN', 'BURO_CREDIT_CURRENCY_nan_MEAN',
       'BURO_CREDIT_TYPE_Another type of loan_MEAN',
       'BURO_CREDIT_TYPE_Car loan_MEAN',
       'BURO_CREDIT_TYPE_Cash loan (non-earmarked)_MEAN',
       'BURO_CREDIT_TYPE_Consumer credit_MEAN',
       'BURO_CREDIT_TYPE_Credit card_MEAN',
       'BURO_CREDIT_TYPE_Interbank credit_MEAN',
       'BURO_CREDIT_TYPE_Loan for business development_MEAN',
       'BURO_CREDIT_TYPE_Loan for purchase of shares (margin lending)_MEAN',
       'BURO_CREDIT_TYPE_Loan for the purchase of equipment_MEAN',
       'BURO_CREDIT_TYPE_Loan for working capital replenishment_MEAN',
       'BURO_CREDIT_TYPE_Microloan_MEAN',
       'BURO_CREDIT_TYPE_Mobile operator loan_MEAN',
       'BURO_CREDIT_TYPE_Mortgage_MEAN',
       'BURO_CREDIT_TYPE_Real estate loan_MEAN',
       'BURO_CREDIT_TYPE_Unknown type of loan_MEAN',
       'BURO_CREDIT_TYPE_nan_MEAN', 'BURO_STATUS_0_MEAN_MEAN',
       'BURO_STATUS_1_MEAN_MEAN', 'BURO_STATUS_2_MEAN_MEAN',
       'BURO_STATUS_3_MEAN_MEAN', 'BURO_STATUS_4_MEAN_MEAN',
       'BURO_STATUS_5_MEAN_MEAN', 'BURO_STATUS_C_MEAN_MEAN',
       'BURO_STATUS_X_MEAN_MEAN', 'BURO_STATUS_nan_MEAN_MEAN'],
      dtype='object')
# 7.9 No duplicate index
bureau_agg.index.nunique()   # 305811
len(set(bureau_agg.index))   # 305811
305811
305811

More Aggregation and merger

We now filter bureau on CREDIT_ACTIVE_Active feature. This will create two subsets of data. This feature has values of 1 and 0.
Filter data where CREDIT_ACTIVE_Active value is 1. Then aggregate(only) numeric features of this filtered data-subset by grouping on SK_ID_CURR. Next, filter, bureau, on CREDIT_ACTIVE_Closed = 1 . And again aggregate the subset on numeric features. Merge all these with bureau_agg (NOT bureau.)

It is as if we are trying to extract the behaviour of those whose credits are active and those whose credits are closed.
# 8.0 In which cases credit is active? Filter data
active = bureau[bureau['CREDIT_ACTIVE_Active'] == 1]
active.head()
active.shape   # (630607, 51)
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT ... MONTHS_BALANCE_SIZE STATUS_0_MEAN STATUS_1_MEAN STATUS_2_MEAN STATUS_3_MEAN STATUS_4_MEAN STATUS_5_MEAN STATUS_C_MEAN STATUS_X_MEAN STATUS_nan_MEAN
1 215354 -208 0 1075.0 NaN NaN 0 225000.0 171342.00 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 215354 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 215354 -203 0 NaN NaN NaN 0 90000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 215354 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 215354 -273 0 27460.0 NaN 0.0 0 180000.0 71017.38 108982.62 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 51 columns

(630607, 51)
# 8.1 Aggregate numercial features of the filtered subset over SK_ID_CURR
active_agg = active.groupby('SK_ID_CURR').agg(num_aggregations)
# 8.1.1
active_agg.head()
active_agg.shape   # (251815, 27)
DAYS_CREDIT DAYS_CREDIT_ENDDATE DAYS_CREDIT_UPDATE CREDIT_DAY_OVERDUE ... AMT_CREDIT_SUM_OVERDUE AMT_CREDIT_SUM_LIMIT AMT_ANNUITY CNT_CREDIT_PROLONG MONTHS_BALANCE_MIN MONTHS_BALANCE_MAX MONTHS_BALANCE_SIZE
min max mean var min max mean mean max mean ... mean mean sum max mean sum min max mean sum
SK_ID_CURR
100001 -559 -49 -309.333333 65110.333333 411.0 1778.0 1030.333333 -10.666667 0 0.0 ... 0.0 0.0000 0.000 10822.5 8272.50 0 -18.0 0.0 10.666667 32.0
100002 -1042 -103 -572.500000 440860.500000 780.0 780.0 780.000000 -15.500000 0 0.0 ... 0.0 15994.2825 31988.565 0.0 0.00 0 -34.0 0.0 10.000000 20.0
100003 -606 -606 -606.000000 NaN 1216.0 1216.0 1216.000000 -43.000000 0 0.0 ... 0.0 810000.0000 810000.000 NaN NaN 0 NaN NaN NaN 0.0
100005 -137 -62 -99.500000 2812.500000 122.0 1324.0 723.000000 -21.000000 0 0.0 ... 0.0 0.0000 0.000 4261.5 2130.75 0 -4.0 0.0 4.000000 8.0
100008 -78 -78 -78.000000 NaN 471.0 471.0 471.000000 -16.000000 0 0.0 ... 0.0 0.0000 0.000 NaN NaN 0 NaN NaN NaN 0.0

5 rows × 27 columns

(251815, 27)
# 8.1.2 Rename multi-indexed columns
active_agg.columns = pd.Index(['ACTIVE_' + e[0] + "_" + e[1].upper() for e in active_agg.columns.tolist()])
active_agg.columns
Index(['ACTIVE_DAYS_CREDIT_MIN', 'ACTIVE_DAYS_CREDIT_MAX',
       'ACTIVE_DAYS_CREDIT_MEAN', 'ACTIVE_DAYS_CREDIT_VAR',
       'ACTIVE_DAYS_CREDIT_ENDDATE_MIN', 'ACTIVE_DAYS_CREDIT_ENDDATE_MAX',
       'ACTIVE_DAYS_CREDIT_ENDDATE_MEAN', 'ACTIVE_DAYS_CREDIT_UPDATE_MEAN',
       'ACTIVE_CREDIT_DAY_OVERDUE_MAX', 'ACTIVE_CREDIT_DAY_OVERDUE_MEAN',
       'ACTIVE_AMT_CREDIT_MAX_OVERDUE_MEAN', 'ACTIVE_AMT_CREDIT_SUM_MAX',
       'ACTIVE_AMT_CREDIT_SUM_MEAN', 'ACTIVE_AMT_CREDIT_SUM_SUM',
       'ACTIVE_AMT_CREDIT_SUM_DEBT_MAX', 'ACTIVE_AMT_CREDIT_SUM_DEBT_MEAN',
       'ACTIVE_AMT_CREDIT_SUM_DEBT_SUM', 'ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN',
       'ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN', 'ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM',
       'ACTIVE_AMT_ANNUITY_MAX', 'ACTIVE_AMT_ANNUITY_MEAN',
       'ACTIVE_CNT_CREDIT_PROLONG_SUM', 'ACTIVE_MONTHS_BALANCE_MIN_MIN',
       'ACTIVE_MONTHS_BALANCE_MAX_MAX', 'ACTIVE_MONTHS_BALANCE_SIZE_MEAN',
       'ACTIVE_MONTHS_BALANCE_SIZE_SUM'],
      dtype='object')
# 9.0 Difference between length of two datasets
active_agg_set = set(active_agg.index)
bureau_agg_set = set(bureau_agg.index)
len(bureau_agg_set)     # 305811
len(active_agg_set)     # 251815
list(bureau_agg_set - active_agg_set)[:4]   # Few examples: {131074, 393220, 262149, 262153]
305811
251815
[131074, 393220, 262149, 262153]
# 9.1 Merge bureau_agg with active_agg over 'SK_ID_CURR'
bureau_agg = bureau_agg.join(
                             active_agg,
                             how='left',
                             on='SK_ID_CURR'
                             )
# 9.2 
bureau_agg.shape    # (305811, 89)
(305811, 89)
# 9.3 Obviouly some rows will hold NaN values for merged columns
bureau_agg.loc[[131074,393220,262149, 262153]]
BURO_DAYS_CREDIT_MIN BURO_DAYS_CREDIT_MAX BURO_DAYS_CREDIT_MEAN BURO_DAYS_CREDIT_VAR BURO_DAYS_CREDIT_ENDDATE_MIN BURO_DAYS_CREDIT_ENDDATE_MAX BURO_DAYS_CREDIT_ENDDATE_MEAN BURO_DAYS_CREDIT_UPDATE_MEAN BURO_CREDIT_DAY_OVERDUE_MAX BURO_CREDIT_DAY_OVERDUE_MEAN ... ACTIVE_AMT_CREDIT_SUM_OVERDUE_MEAN ACTIVE_AMT_CREDIT_SUM_LIMIT_MEAN ACTIVE_AMT_CREDIT_SUM_LIMIT_SUM ACTIVE_AMT_ANNUITY_MAX ACTIVE_AMT_ANNUITY_MEAN ACTIVE_CNT_CREDIT_PROLONG_SUM ACTIVE_MONTHS_BALANCE_MIN_MIN ACTIVE_MONTHS_BALANCE_MAX_MAX ACTIVE_MONTHS_BALANCE_SIZE_MEAN ACTIVE_MONTHS_BALANCE_SIZE_SUM
SK_ID_CURR
131074 -2419 -2419 -2419.0 NaN -1301.0 -1301.0 -1301.0 -1707.0 0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
393220 -478 -478 -478.0 NaN NaN NaN NaN -288.0 0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262149 -373 -373 -373.0 NaN -251.0 -251.0 -251.0 -243.0 0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262153 -1677 -1677 -1677.0 NaN -1589.0 -1589.0 -1589.0 -1012.0 0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 89 columns

# 9.4 Release memory
del active, active_agg
gc.collect()
182
# 10.0 Same steps for the  CREDIT_ACTIVE_Closed =1 cases
#     Bureau: Closed credits - using only numerical aggregations
closed = bureau[bureau['CREDIT_ACTIVE_Closed'] == 1]
closed_agg = closed.groupby('SK_ID_CURR').agg(num_aggregations)
closed_agg.columns = pd.Index(['CLOSED_' + e[0] + "_" + e[1].upper() for e in closed_agg.columns.tolist()])
bureau_agg = bureau_agg.join(closed_agg, how='left', on='SK_ID_CURR')
# 10.1
bureau_agg.shape   # (305811, 116)
(305811, 116)
# 10.2
del closed, closed_agg, bureau
gc.collect()
68
# 10.3 SK_ID_CURR is index. Index is also saved by-default.
bureau_agg.to_csv("processed_bureau_agg.csv.zip", compression = "zip")
##################