POS Cash balance¶
About the data¶
POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.
Feature explanations¶
SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective:
# Solving Kaggle problem: Home Credit Default Risk
# Processing POS_CASH_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
# (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc
# 1.1 Reduce read data size
# There is a file reducing.py
# in this folder. A class
# in it is used to reduce
# dataframe size
# (Code modified by me to
# exclude 'category' dtype)
import reducing
# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
# i) To transform 'object' columns to dummies.
# ii) Treat NaN as one of the categories
# iii) Returns transformed-data and new-columns created
def one_hot_encoder(df, nan_as_category = True):
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df,
columns= categorical_columns,
dummy_na= nan_as_category # Treat NaNs as category
)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 3.1 Some constants
num_rows=None # Implies read all rows
nan_as_category = True # While transforming
# 'object' columns to dummies
About the data¶
POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.
# 3.2 Read previous application data first
pos = pd.read_csv(
'POS_CASH_balance.csv.zip',
nrows = num_rows
)
# 3.0.1 Reduce memory usage by appropriately
# changing data-types per feature:
pos = reducing.Reducer().reduce(pos)
reduced df from 610.4346 MB to 352.9075 MB in 3.50 seconds
# 3.3
pos.shape # (rows: 1,00,01358, cols: 8)
pos.head()
(10001358, 8)
SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
---|---|---|---|---|---|---|---|---|
0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
Feature explanations¶
SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit
# 3.3.1 There is one object type
pos.dtypes.value_counts()
float64 2
uint32 2
uint16 2
object 1
int8 1
dtype: int64
# 4.0 Transform object type columns to OHE
pos, cat_cols = one_hot_encoder(pos, nan_as_category= True)
# 4.1
pos.shape # (10001358, 17)
pos.head()
(10001358, 17)
SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | NAME_CONTRACT_STATUS_Active | NAME_CONTRACT_STATUS_Amortized debt | NAME_CONTRACT_STATUS_Approved | NAME_CONTRACT_STATUS_Canceled | NAME_CONTRACT_STATUS_Completed | NAME_CONTRACT_STATUS_Demand | NAME_CONTRACT_STATUS_Returned to the store | NAME_CONTRACT_STATUS_Signed | NAME_CONTRACT_STATUS_XNA | NAME_CONTRACT_STATUS_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# 4.2 New columns are:
cat_cols
['NAME_CONTRACT_STATUS_Active',
'NAME_CONTRACT_STATUS_Amortized debt',
'NAME_CONTRACT_STATUS_Approved',
'NAME_CONTRACT_STATUS_Canceled',
'NAME_CONTRACT_STATUS_Completed',
'NAME_CONTRACT_STATUS_Demand',
'NAME_CONTRACT_STATUS_Returned to the store',
'NAME_CONTRACT_STATUS_Signed',
'NAME_CONTRACT_STATUS_XNA',
'NAME_CONTRACT_STATUS_nan']
# 4.3 How to aggregate features:
# Note CNT_INSTALMENT and CNT_INSTALMENT_FUTURE
# do not find place:
aggregations = {
'MONTHS_BALANCE': ['max', 'mean', 'size'],
'SK_DPD': ['max', 'mean'],
'SK_DPD_DEF': ['max', 'mean']
}
# 4.3.1
for cat in cat_cols:
aggregations[cat] = ['mean']
# 4.3.2 Full dictionary
aggregations
{'MONTHS_BALANCE': ['max', 'mean', 'size'],
'SK_DPD': ['max', 'mean'],
'SK_DPD_DEF': ['max', 'mean'],
'NAME_CONTRACT_STATUS_Active': ['mean'],
'NAME_CONTRACT_STATUS_Amortized debt': ['mean'],
'NAME_CONTRACT_STATUS_Approved': ['mean'],
'NAME_CONTRACT_STATUS_Canceled': ['mean'],
'NAME_CONTRACT_STATUS_Completed': ['mean'],
'NAME_CONTRACT_STATUS_Demand': ['mean'],
'NAME_CONTRACT_STATUS_Returned to the store': ['mean'],
'NAME_CONTRACT_STATUS_Signed': ['mean'],
'NAME_CONTRACT_STATUS_XNA': ['mean'],
'NAME_CONTRACT_STATUS_nan': ['mean']}
# 5.0 Aggregate now
grouped = pos.groupby('SK_ID_CURR')
pos_agg = grouped.agg(aggregations)
# 5.1
pos_agg.shape # (337252, 17)
pos_agg.head()
pos_agg.columns
(337252, 17)
MONTHS_BALANCE | SK_DPD | SK_DPD_DEF | NAME_CONTRACT_STATUS_Active | NAME_CONTRACT_STATUS_Amortized debt | NAME_CONTRACT_STATUS_Approved | NAME_CONTRACT_STATUS_Canceled | NAME_CONTRACT_STATUS_Completed | NAME_CONTRACT_STATUS_Demand | NAME_CONTRACT_STATUS_Returned to the store | NAME_CONTRACT_STATUS_Signed | NAME_CONTRACT_STATUS_XNA | NAME_CONTRACT_STATUS_nan | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
max | mean | size | max | mean | max | mean | mean | mean | mean | mean | mean | mean | mean | mean | mean | mean | |
SK_ID_CURR | |||||||||||||||||
100001 | -53 | -72.555556 | 9 | 7 | 0.777778 | 7 | 0.777778 | 0.777778 | 0.0 | 0.0 | 0.0 | 0.222222 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100002 | -1 | -10.000000 | 19 | 0 | 0.000000 | 0 | 0.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100003 | -18 | -43.785714 | 28 | 0 | 0.000000 | 0 | 0.000000 | 0.928571 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100004 | -24 | -25.500000 | 4 | 0 | 0.000000 | 0 | 0.000000 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100005 | -15 | -20.000000 | 11 | 0 | 0.000000 | 0 | 0.000000 | 0.818182 | 0.0 | 0.0 | 0.0 | 0.090909 | 0.0 | 0.0 | 0.090909 | 0.0 | 0 |
MultiIndex([( 'MONTHS_BALANCE', 'max'),
( 'MONTHS_BALANCE', 'mean'),
( 'MONTHS_BALANCE', 'size'),
( 'SK_DPD', 'max'),
( 'SK_DPD', 'mean'),
( 'SK_DPD_DEF', 'max'),
( 'SK_DPD_DEF', 'mean'),
( 'NAME_CONTRACT_STATUS_Active', 'mean'),
( 'NAME_CONTRACT_STATUS_Amortized debt', 'mean'),
( 'NAME_CONTRACT_STATUS_Approved', 'mean'),
( 'NAME_CONTRACT_STATUS_Canceled', 'mean'),
( 'NAME_CONTRACT_STATUS_Completed', 'mean'),
( 'NAME_CONTRACT_STATUS_Demand', 'mean'),
('NAME_CONTRACT_STATUS_Returned to the store', 'mean'),
( 'NAME_CONTRACT_STATUS_Signed', 'mean'),
( 'NAME_CONTRACT_STATUS_XNA', 'mean'),
( 'NAME_CONTRACT_STATUS_nan', 'mean')],
)
# 5.2 Rename multiindex columns
pos_agg.columns = pd.Index(['POS_' + e[0] + "_" + e[1].upper() for e in pos_agg.columns.tolist()])
# 5.3
pos_agg.columns
pos_agg.head()
Index(['POS_MONTHS_BALANCE_MAX', 'POS_MONTHS_BALANCE_MEAN',
'POS_MONTHS_BALANCE_SIZE', 'POS_SK_DPD_MAX', 'POS_SK_DPD_MEAN',
'POS_SK_DPD_DEF_MAX', 'POS_SK_DPD_DEF_MEAN',
'POS_NAME_CONTRACT_STATUS_Active_MEAN',
'POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN',
'POS_NAME_CONTRACT_STATUS_Approved_MEAN',
'POS_NAME_CONTRACT_STATUS_Canceled_MEAN',
'POS_NAME_CONTRACT_STATUS_Completed_MEAN',
'POS_NAME_CONTRACT_STATUS_Demand_MEAN',
'POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN',
'POS_NAME_CONTRACT_STATUS_Signed_MEAN',
'POS_NAME_CONTRACT_STATUS_XNA_MEAN',
'POS_NAME_CONTRACT_STATUS_nan_MEAN'],
dtype='object')
POS_MONTHS_BALANCE_MAX | POS_MONTHS_BALANCE_MEAN | POS_MONTHS_BALANCE_SIZE | POS_SK_DPD_MAX | POS_SK_DPD_MEAN | POS_SK_DPD_DEF_MAX | POS_SK_DPD_DEF_MEAN | POS_NAME_CONTRACT_STATUS_Active_MEAN | POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN | POS_NAME_CONTRACT_STATUS_Approved_MEAN | POS_NAME_CONTRACT_STATUS_Canceled_MEAN | POS_NAME_CONTRACT_STATUS_Completed_MEAN | POS_NAME_CONTRACT_STATUS_Demand_MEAN | POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN | POS_NAME_CONTRACT_STATUS_Signed_MEAN | POS_NAME_CONTRACT_STATUS_XNA_MEAN | POS_NAME_CONTRACT_STATUS_nan_MEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||
100001 | -53 | -72.555556 | 9 | 7 | 0.777778 | 7 | 0.777778 | 0.777778 | 0.0 | 0.0 | 0.0 | 0.222222 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100002 | -1 | -10.000000 | 19 | 0 | 0.000000 | 0 | 0.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100003 | -18 | -43.785714 | 28 | 0 | 0.000000 | 0 | 0.000000 | 0.928571 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100004 | -24 | -25.500000 | 4 | 0 | 0.000000 | 0 | 0.000000 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 |
100005 | -15 | -20.000000 | 11 | 0 | 0.000000 | 0 | 0.000000 | 0.818182 | 0.0 | 0.0 | 0.0 | 0.090909 | 0.0 | 0.0 | 0.090909 | 0.0 | 0 |
# 5.4 Count pos cash accounts
# Per client how many entries/rows exist
pos_agg['POS_COUNT'] = pos.groupby('SK_ID_CURR').size()
pos_agg.head()
POS_MONTHS_BALANCE_MAX | POS_MONTHS_BALANCE_MEAN | POS_MONTHS_BALANCE_SIZE | POS_SK_DPD_MAX | POS_SK_DPD_MEAN | POS_SK_DPD_DEF_MAX | POS_SK_DPD_DEF_MEAN | POS_NAME_CONTRACT_STATUS_Active_MEAN | POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN | POS_NAME_CONTRACT_STATUS_Approved_MEAN | POS_NAME_CONTRACT_STATUS_Canceled_MEAN | POS_NAME_CONTRACT_STATUS_Completed_MEAN | POS_NAME_CONTRACT_STATUS_Demand_MEAN | POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN | POS_NAME_CONTRACT_STATUS_Signed_MEAN | POS_NAME_CONTRACT_STATUS_XNA_MEAN | POS_NAME_CONTRACT_STATUS_nan_MEAN | POS_COUNT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | ||||||||||||||||||
100001 | -53 | -72.555556 | 9 | 7 | 0.777778 | 7 | 0.777778 | 0.777778 | 0.0 | 0.0 | 0.0 | 0.222222 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 | 9 |
100002 | -1 | -10.000000 | 19 | 0 | 0.000000 | 0 | 0.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 | 19 |
100003 | -18 | -43.785714 | 28 | 0 | 0.000000 | 0 | 0.000000 | 0.928571 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 | 28 |
100004 | -24 | -25.500000 | 4 | 0 | 0.000000 | 0 | 0.000000 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.250000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0 | 4 |
100005 | -15 | -20.000000 | 11 | 0 | 0.000000 | 0 | 0.000000 | 0.818182 | 0.0 | 0.0 | 0.0 | 0.090909 | 0.0 | 0.0 | 0.090909 | 0.0 | 0 | 11 |
# 6.0 Save the results for subsequent use:
pos_agg.to_csv("processed_pos_agg.csv.zip", compression = "zip")
######################