Credit Card balance data

About the Dataset

credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

Feature Explanations

SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,...) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# Last amended: 23rd October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing credit_card_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns
# 2.1
pathToFolder = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToFolder)
# 2.2 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

About the Dataset

credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

# 2.3 Read the data
cc = pd.read_csv(
                'credit_card_balance.csv.zip',
                 nrows = num_rows
                )
# 2.4
cc.shape      # (rows = 38,40,312, columns = 23)
cc.head()
(3840312, 23)
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 1800.0 1800.0 0.000 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 2250.0 2250.0 60175.080 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 2250.0 2250.0 26926.425 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 11925.0 11925.0 224949.285 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 27000.0 27000.0 443044.395 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0

Feature explanations:

MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,…) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# 2.5 There is one 'object' feature
cc.dtypes.value_counts()
float64    15
int64      7 
object     1 
dtype: int64
# 2.6 Transform the 'object' feature to OHE
cc, cat_cols = one_hot_encoder(cc, nan_as_category= True)
# 2.7
cc.shape   # (3840312, 30)
cat_cols   # Even NaN is a feature
(3840312, 30)
['NAME_CONTRACT_STATUS_Active',
 'NAME_CONTRACT_STATUS_Approved',
 'NAME_CONTRACT_STATUS_Completed',
 'NAME_CONTRACT_STATUS_Demand',
 'NAME_CONTRACT_STATUS_Refused',
 'NAME_CONTRACT_STATUS_Sent proposal',
 'NAME_CONTRACT_STATUS_Signed',
 'NAME_CONTRACT_STATUS_nan']
# 2.8 Drop this unique ID. We do not need it
cc.drop(['SK_ID_PREV'], axis= 1, inplace = True)
# 3.0 Aggregate all features over SK_ID_CURR.
cc_agg = cc.groupby('SK_ID_CURR').agg(['min', 'max', 'mean', 'sum', 'var'])
# 3.1
cc_agg.shape     # (103558, 140)
cc_agg.head()    # It has multi-index feature
(103558, 140)
MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT ... SK_DPD SK_DPD_DEF NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Sent proposal NAME_CONTRACT_STATUS_Signed NAME_CONTRACT_STATUS_nan
min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var ... min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var min max mean sum var
SK_ID_CURR
100006 -6 -1 -3.5 -21 3.5 0.0 0.00 0.000000 0.000 0.000000e+00 270000 270000 270000.000000 1620000 0.000000e+00 NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.0 0.000000e+00 NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 NaN NaN NaN 0.00 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 ... 0 0 0.000000 0 0.000000 0 0 0.000000 0 0.000000 1 1 1.000000 6 0.000000 0 0 0.0 0 0.0 0 0 0.000000 0 0.000000 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0 0 0.0
100011 -75 -2 -38.5 -2849 462.5 0.0 189000.00 54482.111149 4031676.225 4.641321e+09 90000 180000 164189.189189 12150000 1.189060e+09 0.0 180000.0 2432.432432 180000.0 4.378378e+08 0.0 180000.0 2432.432432 180000.0 4.378378e+08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9000.0 3956.221849 288804.195 2.013991e+07 0.0 55485.0 4843.064189 358386.75 5.299260e+07 0.0 55485.0 4520.067568 334485.000 5.585877e+07 ... 0 0 0.000000 0 0.000000 0 0 0.000000 0 0.000000 1 1 1.000000 74 0.000000 0 0 0.0 0 0.0 0 0 0.000000 0 0.000000 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0 0 0.0
100013 -96 -1 -48.5 -4656 776.0 0.0 161420.22 18159.919219 1743352.245 1.869473e+09 45000 157500 131718.750000 12645000 2.259252e+09 0.0 157500.0 6350.000000 571500.0 8.249688e+08 0.0 157500.0 5953.125000 571500.0 7.752530e+08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7875.0 1454.539551 129454.020 9.171263e+06 0.0 153675.0 7168.346250 688161.24 4.676901e+08 0.0 153675.0 6817.172344 654448.545 4.722214e+08 ... 0 1 0.010417 1 0.010417 0 1 0.010417 1 0.010417 1 1 1.000000 96 0.000000 0 0 0.0 0 0.0 0 0 0.000000 0 0.000000 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0 0 0.0
100021 -18 -2 -10.0 -170 25.5 0.0 0.00 0.000000 0.000 0.000000e+00 675000 675000 675000.000000 11475000 0.000000e+00 NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.0 0.000000e+00 NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 NaN NaN NaN 0.00 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 ... 0 0 0.000000 0 0.000000 0 0 0.000000 0 0.000000 0 1 0.411765 7 0.257353 0 0 0.0 0 0.0 0 1 0.588235 10 0.257353 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0 0 0.0
100023 -11 -4 -7.5 -60 6.0 0.0 0.00 0.000000 0.000 0.000000e+00 45000 225000 135000.000000 1080000 9.257143e+09 NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.0 0.000000e+00 NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 NaN NaN NaN 0.00 NaN 0.0 0.0 0.000000 0.000 0.000000e+00 ... 0 0 0.000000 0 0.000000 0 0 0.000000 0 0.000000 1 1 1.000000 8 0.000000 0 0 0.0 0 0.0 0 0 0.000000 0 0.000000 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0.0 0 0.0 0 0 0 0 0.0

5 rows × 140 columns

# 3.2 Change column names
cc_agg.columns = pd.Index(['CC_' + e[0] + "_" + e[1].upper() for e in cc_agg.columns.tolist()])
# 3.3 Create another feature
#     For each client, how many observations
#     exist in this dataset

cc_agg['CC_COUNT'] = cc.groupby('SK_ID_CURR').size()
# 3.3.1
cc_agg['CC_COUNT'].head()
SK_ID_CURR
100006    6 
100011    74
100013    96
100021    17
100023    8 
Name: CC_COUNT, dtype: int64
# 4.0 Save the results for subsequent use:
cc_agg.to_csv("processed_creditCard_agg.csv.zip", compression = "zip")
################