POS Cash balance¶

About the data¶

POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.

Feature explanations¶

SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing POS_CASH_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features

# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)

# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)

# 3.1 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

About the data¶

POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.

# 3.2 Read previous application data first
pos = pd.read_csv(
                   'POS_CASH_balance.csv.zip',
                   nrows = num_rows
                   )

# 3.0.1 Reduce memory usage by appropriately
#       changing data-types per feature:

pos = reducing.Reducer().reduce(pos)

reduced df from 610.4346 MB to 352.9075 MB in 3.50 seconds

# 3.3
pos.shape    # (rows: 1,00,01358, cols: 8)
pos.head()

(10001358, 8)

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	CNT_INSTALMENT	CNT_INSTALMENT_FUTURE	NAME_CONTRACT_STATUS
0	1803195	182943	-31	48.0	45.0	Active
1	1715348	367990	-33	36.0	35.0	Active
2	1784872	397406	-32	12.0	9.0	Active
3	1903291	269225	-35	48.0	42.0	Active
4	2341044	334279	-35	36.0	35.0	Active

Feature explanations¶

SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# 3.3.1 There is one object type
pos.dtypes.value_counts()

float64    2
uint32     2
uint16     2
object     1
int8       1
dtype: int64

# 4.0 Transform object type columns to OHE
pos, cat_cols = one_hot_encoder(pos, nan_as_category= True)

# 4.1
pos.shape    # (10001358, 17)
pos.head()

(10001358, 17)

	SK_ID_PREV	SK_ID_CURR	MONTHS_BALANCE	CNT_INSTALMENT	CNT_INSTALMENT_FUTURE	NAME_CONTRACT_STATUS_Active
0	1803195	182943	-31	48.0	45.0	1
1	1715348	367990	-33	36.0	35.0	1
2	1784872	397406	-32	12.0	9.0	1
3	1903291	269225	-35	48.0	42.0	1
4	2341044	334279	-35	36.0	35.0	1

# 4.2 New columns are:
cat_cols

['NAME_CONTRACT_STATUS_Active',
 'NAME_CONTRACT_STATUS_Amortized debt',
 'NAME_CONTRACT_STATUS_Approved',
 'NAME_CONTRACT_STATUS_Canceled',
 'NAME_CONTRACT_STATUS_Completed',
 'NAME_CONTRACT_STATUS_Demand',
 'NAME_CONTRACT_STATUS_Returned to the store',
 'NAME_CONTRACT_STATUS_Signed',
 'NAME_CONTRACT_STATUS_XNA',
 'NAME_CONTRACT_STATUS_nan']

# 4.3 How to aggregate features:
#     Note CNT_INSTALMENT and CNT_INSTALMENT_FUTURE
#       do not find place:

aggregations = {
                'MONTHS_BALANCE': ['max', 'mean', 'size'],
                'SK_DPD':         ['max', 'mean'],
                'SK_DPD_DEF':     ['max', 'mean']
               }

# 4.3.1
for cat in cat_cols:
    aggregations[cat] = ['mean']
    

# 4.3.2 Full dictionary
aggregations

{'MONTHS_BALANCE': ['max', 'mean', 'size'],
 'SK_DPD': ['max', 'mean'],
 'SK_DPD_DEF': ['max', 'mean'],
 'NAME_CONTRACT_STATUS_Active': ['mean'],
 'NAME_CONTRACT_STATUS_Amortized debt': ['mean'],
 'NAME_CONTRACT_STATUS_Approved': ['mean'],
 'NAME_CONTRACT_STATUS_Canceled': ['mean'],
 'NAME_CONTRACT_STATUS_Completed': ['mean'],
 'NAME_CONTRACT_STATUS_Demand': ['mean'],
 'NAME_CONTRACT_STATUS_Returned to the store': ['mean'],
 'NAME_CONTRACT_STATUS_Signed': ['mean'],
 'NAME_CONTRACT_STATUS_XNA': ['mean'],
 'NAME_CONTRACT_STATUS_nan': ['mean']}

# 5.0 Aggregate now
grouped = pos.groupby('SK_ID_CURR')
pos_agg = grouped.agg(aggregations)

# 5.1
pos_agg.shape     # (337252, 17)
pos_agg.head()
pos_agg.columns

(337252, 17)

	MONTHS_BALANCE			SK_DPD		SK_DPD_DEF		NAME_CONTRACT_STATUS_Active	NAME_CONTRACT_STATUS_Amortized debt	NAME_CONTRACT_STATUS_Approved	NAME_CONTRACT_STATUS_Canceled	NAME_CONTRACT_STATUS_Completed	NAME_CONTRACT_STATUS_Demand	NAME_CONTRACT_STATUS_Returned to the store	NAME_CONTRACT_STATUS_Signed	NAME_CONTRACT_STATUS_XNA	NAME_CONTRACT_STATUS_nan
	max	mean	size	max	mean	max	mean	mean	mean	mean	mean	mean	mean	mean	mean	mean	mean
SK_ID_CURR
100001	-53	-72.555556	9	7	0.777778	7	0.777778	0.777778	0.0	0.0	0.0	0.222222	0.0	0.0	0.000000	0.0	0
100002	-1	-10.000000	19	0	0.000000	0	0.000000	1.000000	0.0	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0
100003	-18	-43.785714	28	0	0.000000	0	0.000000	0.928571	0.0	0.0	0.0	0.071429	0.0	0.0	0.000000	0.0	0
100004	-24	-25.500000	4	0	0.000000	0	0.000000	0.750000	0.0	0.0	0.0	0.250000	0.0	0.0	0.000000	0.0	0
100005	-15	-20.000000	11	0	0.000000	0	0.000000	0.818182	0.0	0.0	0.0	0.090909	0.0	0.0	0.090909	0.0	0

MultiIndex([(                            'MONTHS_BALANCE',  'max'),
            (                            'MONTHS_BALANCE', 'mean'),
            (                            'MONTHS_BALANCE', 'size'),
            (                                    'SK_DPD',  'max'),
            (                                    'SK_DPD', 'mean'),
            (                                'SK_DPD_DEF',  'max'),
            (                                'SK_DPD_DEF', 'mean'),
            (               'NAME_CONTRACT_STATUS_Active', 'mean'),
            (       'NAME_CONTRACT_STATUS_Amortized debt', 'mean'),
            (             'NAME_CONTRACT_STATUS_Approved', 'mean'),
            (             'NAME_CONTRACT_STATUS_Canceled', 'mean'),
            (            'NAME_CONTRACT_STATUS_Completed', 'mean'),
            (               'NAME_CONTRACT_STATUS_Demand', 'mean'),
            ('NAME_CONTRACT_STATUS_Returned to the store', 'mean'),
            (               'NAME_CONTRACT_STATUS_Signed', 'mean'),
            (                  'NAME_CONTRACT_STATUS_XNA', 'mean'),
            (                  'NAME_CONTRACT_STATUS_nan', 'mean')],
           )

# 5.2 Rename multiindex columns
pos_agg.columns = pd.Index(['POS_' + e[0] + "_" + e[1].upper() for e in pos_agg.columns.tolist()])

# 5.3
pos_agg.columns
pos_agg.head()

Index(['POS_MONTHS_BALANCE_MAX', 'POS_MONTHS_BALANCE_MEAN',
       'POS_MONTHS_BALANCE_SIZE', 'POS_SK_DPD_MAX', 'POS_SK_DPD_MEAN',
       'POS_SK_DPD_DEF_MAX', 'POS_SK_DPD_DEF_MEAN',
       'POS_NAME_CONTRACT_STATUS_Active_MEAN',
       'POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN',
       'POS_NAME_CONTRACT_STATUS_Approved_MEAN',
       'POS_NAME_CONTRACT_STATUS_Canceled_MEAN',
       'POS_NAME_CONTRACT_STATUS_Completed_MEAN',
       'POS_NAME_CONTRACT_STATUS_Demand_MEAN',
       'POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN',
       'POS_NAME_CONTRACT_STATUS_Signed_MEAN',
       'POS_NAME_CONTRACT_STATUS_XNA_MEAN',
       'POS_NAME_CONTRACT_STATUS_nan_MEAN'],
      dtype='object')

	POS_MONTHS_BALANCE_MAX	POS_MONTHS_BALANCE_MEAN	POS_MONTHS_BALANCE_SIZE	POS_SK_DPD_MAX	POS_SK_DPD_MEAN	POS_SK_DPD_DEF_MAX	POS_SK_DPD_DEF_MEAN	POS_NAME_CONTRACT_STATUS_Active_MEAN	POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN	POS_NAME_CONTRACT_STATUS_Approved_MEAN	POS_NAME_CONTRACT_STATUS_Canceled_MEAN	POS_NAME_CONTRACT_STATUS_Completed_MEAN	POS_NAME_CONTRACT_STATUS_Demand_MEAN	POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN	POS_NAME_CONTRACT_STATUS_Signed_MEAN	POS_NAME_CONTRACT_STATUS_XNA_MEAN	POS_NAME_CONTRACT_STATUS_nan_MEAN
SK_ID_CURR
100001	-53	-72.555556	9	7	0.777778	7	0.777778	0.777778	0.0	0.0	0.0	0.222222	0.0	0.0	0.000000	0.0	0
100002	-1	-10.000000	19	0	0.000000	0	0.000000	1.000000	0.0	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0
100003	-18	-43.785714	28	0	0.000000	0	0.000000	0.928571	0.0	0.0	0.0	0.071429	0.0	0.0	0.000000	0.0	0
100004	-24	-25.500000	4	0	0.000000	0	0.000000	0.750000	0.0	0.0	0.0	0.250000	0.0	0.0	0.000000	0.0	0
100005	-15	-20.000000	11	0	0.000000	0	0.000000	0.818182	0.0	0.0	0.0	0.090909	0.0	0.0	0.090909	0.0	0

# 5.4 Count pos cash accounts
#     Per client how many entries/rows exist
pos_agg['POS_COUNT'] = pos.groupby('SK_ID_CURR').size()

pos_agg.head()

	POS_MONTHS_BALANCE_MAX	POS_MONTHS_BALANCE_MEAN	POS_MONTHS_BALANCE_SIZE	POS_SK_DPD_MAX	POS_SK_DPD_MEAN	POS_SK_DPD_DEF_MAX	POS_SK_DPD_DEF_MEAN	POS_NAME_CONTRACT_STATUS_Active_MEAN	POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN	POS_NAME_CONTRACT_STATUS_Approved_MEAN	POS_NAME_CONTRACT_STATUS_Canceled_MEAN	POS_NAME_CONTRACT_STATUS_Completed_MEAN	POS_NAME_CONTRACT_STATUS_Demand_MEAN	POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN	POS_NAME_CONTRACT_STATUS_Signed_MEAN	POS_NAME_CONTRACT_STATUS_XNA_MEAN	POS_NAME_CONTRACT_STATUS_nan_MEAN	POS_COUNT
SK_ID_CURR
100001	-53	-72.555556	9	7	0.777778	7	0.777778	0.777778	0.0	0.0	0.0	0.222222	0.0	0.0	0.000000	0.0	0	9
100002	-1	-10.000000	19	0	0.000000	0	0.000000	1.000000	0.0	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0	19
100003	-18	-43.785714	28	0	0.000000	0	0.000000	0.928571	0.0	0.0	0.0	0.071429	0.0	0.0	0.000000	0.0	0	28
100004	-24	-25.500000	4	0	0.000000	0	0.000000	0.750000	0.0	0.0	0.0	0.250000	0.0	0.0	0.000000	0.0	0	4
100005	-15	-20.000000	11	0	0.000000	0	0.000000	0.818182	0.0	0.0	0.0	0.090909	0.0	0.0	0.090909	0.0	0	11

# 6.0 Save the results for subsequent use:
pos_agg.to_csv("processed_pos_agg.csv.zip", compression = "zip")   
    

######################

Home Credit Default Risk Assessment

POS Cash balance¶

About the data¶

Feature explanations¶

About the data¶

Feature explanations¶