Credit Card balance data¶

About the Dataset¶

credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

Feature Explanations¶

SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,...) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# Last amended: 23rd October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing credit_card_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features

# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)

# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

# 2.1
pathToFolder = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToFolder)

# 2.2 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

About the Dataset¶

credit_card_balance: It is monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.

# 2.3 Read the data
cc = pd.read_csv(
                'credit_card_balance.csv.zip',
                 nrows = num_rows
                )

# 2.4
cc.shape      # (rows = 38,40,312, columns = 23)
cc.head()

(3840312, 23)

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	AMT_BALANCE	AMT_CREDIT_LIMIT_ACTUAL	AMT_DRAWINGS_ATM_CURRENT	AMT_DRAWINGS_CURRENT	AMT_DRAWINGS_POS_CURRENT	AMT_INST_MIN_REGULARITY	AMT_PAYMENT_CURRENT	AMT_PAYMENT_TOTAL_CURRENT	AMT_RECEIVABLE_PRINCIPAL	AMT_RECIVABLE	AMT_TOTAL_RECEIVABLE	CNT_DRAWINGS_ATM_CURRENT	CNT_DRAWINGS_CURRENT	CNT_DRAWINGS_POS_CURRENT	CNT_INSTALMENT_MATURE_CUM	NAME_CONTRACT_STATUS
0	2562384	378907	-6	56.970	135000	0.0	877.5	877.5	1700.325	1800.0	1800.0	0.000	0.000	0.000	0.0	1	1.0	35.0	Active
1	2582071	363914	-1	63975.555	45000	2250.0	2250.0	0.0	2250.000	2250.0	2250.0	60175.080	64875.555	64875.555	1.0	1	0.0	69.0	Active
2	1740877	371185	-7	31815.225	450000	0.0	0.0	0.0	2250.000	2250.0	2250.0	26926.425	31460.085	31460.085	0.0	0	0.0	30.0	Active
3	1389973	337855	-4	236572.110	225000	2250.0	2250.0	0.0	11795.760	11925.0	11925.0	224949.285	233048.970	233048.970	1.0	1	0.0	10.0	Active
4	1891521	126868	-1	453919.455	450000	0.0	11547.0	11547.0	22924.890	27000.0	27000.0	443044.395	453919.455	453919.455	0.0	1	1.0	101.0	Active

Feature explanations:¶

MONTHS_BALANCE: Month of balance relative to application date (-1 means the freshest balance date)
AMT_BALANCE: Balance during the month of previous credit
AMT_CREDIT_LIMIT_ACTUAL: Credit card limit during the month of the previous credit
AMT_DRAWINGS_ATM_CURRENT: Amount drawing at ATM during the month of the previous credit
AMT_DRAWINGS_CURRENT: Amount drawing during the month of the previous credit
AMT_DRAWINGS_OTHER_CURRENT: Amount of other drawings during the month of the previous credit
AMT_DRAWINGS_POS_CURRENT: Amount drawing or buying goods during the month of the previous credit
AMT_INST_MIN_REGULARITY: Minimal installment for this month of the previous credit
AMT_PAYMENT_CURRENT: How much did the client pay during the month on the previous credit
AMT_PAYMENT_TOTAL_CURRENT: How much did the client pay during the month in total on the previous credit
AMT_RECEIVABLE_PRINCIPAL: Amount receivable for principal on the previous credit
AMT_RECIVABLE: Amount receivable on the previous credit
AMT_TOTAL_RECEIVABLE: Total amount receivable on the previous credit
CNT_DRAWINGS_ATM_CURRENT: Number of drawings at ATM during this month on the previous credit
CNT_DRAWINGS_CURRENT: Number of drawings during this month on the previous credit
CNT_DRAWINGS_OTHER_CURRENT: Number of other drawings during this month on the previous credit
CNT_DRAWINGS_POS_CURRENT: Number of drawings for goods during this month on the previous credit
CNT_INSTALMENT_MATURE_CUM: Number of paid installments on the previous credit
NAME_CONTRACT_STATUS: Contract status (active signed,…) on the previous credit
SK_DPD: DPD (Days past due) during the month on the previous credit
SK_DPD_DEF: DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# 2.5 There is one 'object' feature
cc.dtypes.value_counts()

float64    15
int64      7 
object     1 
dtype: int64

# 2.6 Transform the 'object' feature to OHE
cc, cat_cols = one_hot_encoder(cc, nan_as_category= True)

# 2.7
cc.shape   # (3840312, 30)
cat_cols   # Even NaN is a feature

(3840312, 30)

['NAME_CONTRACT_STATUS_Active',
 'NAME_CONTRACT_STATUS_Approved',
 'NAME_CONTRACT_STATUS_Completed',
 'NAME_CONTRACT_STATUS_Demand',
 'NAME_CONTRACT_STATUS_Refused',
 'NAME_CONTRACT_STATUS_Sent proposal',
 'NAME_CONTRACT_STATUS_Signed',
 'NAME_CONTRACT_STATUS_nan']

# 2.8 Drop this unique ID. We do not need it
cc.drop(['SK_ID_PREV'], axis= 1, inplace = True)

# 3.0 Aggregate all features over SK_ID_CURR.
cc_agg = cc.groupby('SK_ID_CURR').agg(['min', 'max', 'mean', 'sum', 'var'])

# 3.1
cc_agg.shape     # (103558, 140)
cc_agg.head()    # It has multi-index feature

(103558, 140)

	MONTHS_BALANCE					AMT_BALANCE					AMT_CREDIT_LIMIT_ACTUAL					AMT_DRAWINGS_ATM_CURRENT					AMT_DRAWINGS_CURRENT					AMT_DRAWINGS_OTHER_CURRENT					AMT_DRAWINGS_POS_CURRENT					AMT_INST_MIN_REGULARITY					AMT_PAYMENT_CURRENT					AMT_PAYMENT_TOTAL_CURRENT					...	SK_DPD					SK_DPD_DEF					NAME_CONTRACT_STATUS_Active					NAME_CONTRACT_STATUS_Approved					NAME_CONTRACT_STATUS_Completed					NAME_CONTRACT_STATUS_Demand					NAME_CONTRACT_STATUS_Refused					NAME_CONTRACT_STATUS_Sent proposal					NAME_CONTRACT_STATUS_Signed					NAME_CONTRACT_STATUS_nan
	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	...	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var	min	max	mean	sum	var
SK_ID_CURR
100006	-6	-1	-3.5	-21	3.5	0.0	0.00	0.000000	0.000	0.000000e+00	270000	270000	270000.000000	1620000	0.000000e+00	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.0	0.000000e+00	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	NaN	NaN	NaN	0.00	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	...	0	0	0.000000	0	0.000000	0	0	0.000000	0	0.000000	1	1	1.000000	6	0.000000	0	0	0.0	0	0.0	0	0	0.000000	0	0.000000	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0	0	0.0
100011	-75	-2	-38.5	-2849	462.5	0.0	189000.00	54482.111149	4031676.225	4.641321e+09	90000	180000	164189.189189	12150000	1.189060e+09	0.0	180000.0	2432.432432	180000.0	4.378378e+08	0.0	180000.0	2432.432432	180000.0	4.378378e+08	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	9000.0	3956.221849	288804.195	2.013991e+07	0.0	55485.0	4843.064189	358386.75	5.299260e+07	0.0	55485.0	4520.067568	334485.000	5.585877e+07	...	0	0	0.000000	0	0.000000	0	0	0.000000	0	0.000000	1	1	1.000000	74	0.000000	0	0	0.0	0	0.0	0	0	0.000000	0	0.000000	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0	0	0.0
100013	-96	-1	-48.5	-4656	776.0	0.0	161420.22	18159.919219	1743352.245	1.869473e+09	45000	157500	131718.750000	12645000	2.259252e+09	0.0	157500.0	6350.000000	571500.0	8.249688e+08	0.0	157500.0	5953.125000	571500.0	7.752530e+08	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	7875.0	1454.539551	129454.020	9.171263e+06	0.0	153675.0	7168.346250	688161.24	4.676901e+08	0.0	153675.0	6817.172344	654448.545	4.722214e+08	...	0	1	0.010417	1	0.010417	0	1	0.010417	1	0.010417	1	1	1.000000	96	0.000000	0	0	0.0	0	0.0	0	0	0.000000	0	0.000000	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0	0	0.0
100021	-18	-2	-10.0	-170	25.5	0.0	0.00	0.000000	0.000	0.000000e+00	675000	675000	675000.000000	11475000	0.000000e+00	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.0	0.000000e+00	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	NaN	NaN	NaN	0.00	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	...	0	0	0.000000	0	0.000000	0	0	0.000000	0	0.000000	0	1	0.411765	7	0.257353	0	0	0.0	0	0.0	0	1	0.588235	10	0.257353	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0	0	0.0
100023	-11	-4	-7.5	-60	6.0	0.0	0.00	0.000000	0.000	0.000000e+00	45000	225000	135000.000000	1080000	9.257143e+09	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.0	0.000000e+00	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	0.0	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	NaN	NaN	NaN	0.00	NaN	0.0	0.0	0.000000	0.000	0.000000e+00	...	0	0	0.000000	0	0.000000	0	0	0.000000	0	0.000000	1	1	1.000000	8	0.000000	0	0	0.0	0	0.0	0	0	0.000000	0	0.000000	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0.0	0	0.0	0	0	0	0	0.0

5 rows × 140 columns

# 3.2 Change column names
cc_agg.columns = pd.Index(['CC_' + e[0] + "_" + e[1].upper() for e in cc_agg.columns.tolist()])

# 3.3 Create another feature
#     For each client, how many observations
#     exist in this dataset

cc_agg['CC_COUNT'] = cc.groupby('SK_ID_CURR').size()

# 3.3.1
cc_agg['CC_COUNT'].head()

SK_ID_CURR
100006    6 
100011    74
100013    96
100021    17
100023    8 
Name: CC_COUNT, dtype: int64

# 4.0 Save the results for subsequent use:
cc_agg.to_csv("processed_creditCard_agg.csv.zip", compression = "zip")

################

Home Credit Default Risk Assessment

Credit Card balance data¶

About the Dataset¶

Feature Explanations¶

About the Dataset¶

Feature explanations:¶