Credit Card balance data¶
About the Dataset¶
credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
Feature Explanations¶
SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,...) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
# Last amended: 23rd October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective:
# Solving Kaggle problem: Home Credit Default Risk
# Processing credit_card_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
# (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc
# 1.1 Reduce read data size
# There is a file reducing.py
# in this folder. A class
# in it is used to reduce
# dataframe size
# (Code modified by me to
# exclude 'category' dtype)
import reducing
# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
# i) To transform 'object' columns to dummies.
# ii) Treat NaN as one of the categories
# iii) Returns transformed-data and new-columns created
def one_hot_encoder(df, nan_as_category = True):
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df,
columns= categorical_columns,
dummy_na= nan_as_category # Treat NaNs as category
)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns
# 2.1
pathToFolder = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToFolder)
# 2.2 Some constants
num_rows=None # Implies read all rows
nan_as_category = True # While transforming
# 'object' columns to dummies
About the Dataset¶
credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
# 2.3 Read the data
cc = pd.read_csv(
'credit_card_balance.csv.zip',
nrows = num_rows
)
# 2.4
cc.shape # (rows = 38,40,312, columns = 23)
cc.head()
(3840312, 23)
SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.0 | 1800.0 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.0 | 11925.0 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.0 | 27000.0 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
Feature explanations:¶
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,…) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
# 2.5 There is one 'object' feature
cc.dtypes.value_counts()
float64 15
int64 7
object 1
dtype: int64
# 2.6 Transform the 'object' feature to OHE
cc, cat_cols = one_hot_encoder(cc, nan_as_category= True)
# 2.7
cc.shape # (3840312, 30)
cat_cols # Even NaN is a feature
(3840312, 30)
['NAME_CONTRACT_STATUS_Active',
'NAME_CONTRACT_STATUS_Approved',
'NAME_CONTRACT_STATUS_Completed',
'NAME_CONTRACT_STATUS_Demand',
'NAME_CONTRACT_STATUS_Refused',
'NAME_CONTRACT_STATUS_Sent proposal',
'NAME_CONTRACT_STATUS_Signed',
'NAME_CONTRACT_STATUS_nan']
# 2.8 Drop this unique ID. We do not need it
cc.drop(['SK_ID_PREV'], axis= 1, inplace = True)
# 3.0 Aggregate all features over SK_ID_CURR.
cc_agg = cc.groupby('SK_ID_CURR').agg(['min', 'max', 'mean', 'sum', 'var'])
# 3.1
cc_agg.shape # (103558, 140)
cc_agg.head() # It has multi-index feature
(103558, 140)
MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | ... | SK_DPD | SK_DPD_DEF | NAME_CONTRACT_STATUS_Active | NAME_CONTRACT_STATUS_Approved | NAME_CONTRACT_STATUS_Completed | NAME_CONTRACT_STATUS_Demand | NAME_CONTRACT_STATUS_Refused | NAME_CONTRACT_STATUS_Sent proposal | NAME_CONTRACT_STATUS_Signed | NAME_CONTRACT_STATUS_nan | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | ... | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | min | max | mean | sum | var | |
SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100006 | -6 | -1 | -3.5 | -21 | 3.5 | 0.0 | 0.00 | 0.000000 | 0.000 | 0.000000e+00 | 270000 | 270000 | 270000.000000 | 1620000 | 0.000000e+00 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000e+00 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | NaN | NaN | NaN | 0.00 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | ... | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1 | 1.000000 | 6 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.0 |
100011 | -75 | -2 | -38.5 | -2849 | 462.5 | 0.0 | 189000.00 | 54482.111149 | 4031676.225 | 4.641321e+09 | 90000 | 180000 | 164189.189189 | 12150000 | 1.189060e+09 | 0.0 | 180000.0 | 2432.432432 | 180000.0 | 4.378378e+08 | 0.0 | 180000.0 | 2432.432432 | 180000.0 | 4.378378e+08 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9000.0 | 3956.221849 | 288804.195 | 2.013991e+07 | 0.0 | 55485.0 | 4843.064189 | 358386.75 | 5.299260e+07 | 0.0 | 55485.0 | 4520.067568 | 334485.000 | 5.585877e+07 | ... | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1 | 1.000000 | 74 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.0 |
100013 | -96 | -1 | -48.5 | -4656 | 776.0 | 0.0 | 161420.22 | 18159.919219 | 1743352.245 | 1.869473e+09 | 45000 | 157500 | 131718.750000 | 12645000 | 2.259252e+09 | 0.0 | 157500.0 | 6350.000000 | 571500.0 | 8.249688e+08 | 0.0 | 157500.0 | 5953.125000 | 571500.0 | 7.752530e+08 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7875.0 | 1454.539551 | 129454.020 | 9.171263e+06 | 0.0 | 153675.0 | 7168.346250 | 688161.24 | 4.676901e+08 | 0.0 | 153675.0 | 6817.172344 | 654448.545 | 4.722214e+08 | ... | 0 | 1 | 0.010417 | 1 | 0.010417 | 0 | 1 | 0.010417 | 1 | 0.010417 | 1 | 1 | 1.000000 | 96 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.0 |
100021 | -18 | -2 | -10.0 | -170 | 25.5 | 0.0 | 0.00 | 0.000000 | 0.000 | 0.000000e+00 | 675000 | 675000 | 675000.000000 | 11475000 | 0.000000e+00 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000e+00 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | NaN | NaN | NaN | 0.00 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | ... | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 1 | 0.411765 | 7 | 0.257353 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 1 | 0.588235 | 10 | 0.257353 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.0 |
100023 | -11 | -4 | -7.5 | -60 | 6.0 | 0.0 | 0.00 | 0.000000 | 0.000 | 0.000000e+00 | 45000 | 225000 | 135000.000000 | 1080000 | 9.257143e+09 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000e+00 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | NaN | NaN | NaN | 0.00 | NaN | 0.0 | 0.0 | 0.000000 | 0.000 | 0.000000e+00 | ... | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.000000 | 0 | 0.000000 | 1 | 1 | 1.000000 | 8 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.000000 | 0 | 0.000000 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.0 |
5 rows × 140 columns
# 3.2 Change column names
cc_agg.columns = pd.Index(['CC_' + e[0] + "_" + e[1].upper() for e in cc_agg.columns.tolist()])
# 3.3 Create another feature
# For each client, how many observations
# exist in this dataset
cc_agg['CC_COUNT'] = cc.groupby('SK_ID_CURR').size()
# 3.3.1
cc_agg['CC_COUNT'].head()
SK_ID_CURR
100006 6
100011 74
100013 96
100021 17
100023 8
Name: CC_COUNT, dtype: int64
# 4.0 Save the results for subsequent use:
cc_agg.to_csv("processed_creditCard_agg.csv.zip", compression = "zip")
################