POS Cash balance

About the data

POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.

Feature explanations

SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing POS_CASH_balance dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 3.1 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

About the data

POS_CASH_BALANCE: Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. This dataset contrasts with bureau_balance dataset where monthly installments were of loans with bureau.
# 3.2 Read previous application data first
pos = pd.read_csv(
                   'POS_CASH_balance.csv.zip',
                   nrows = num_rows
                   )

# 3.0.1 Reduce memory usage by appropriately
#       changing data-types per feature:

pos = reducing.Reducer().reduce(pos)
reduced df from 610.4346 MB to 352.9075 MB in 3.50 seconds
# 3.3
pos.shape    # (rows: 1,00,01358, cols: 8)
pos.head()
(10001358, 8)
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0

Feature explanations

SK_ID_PREV : ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) SK_ID_CURR: ID of loan in our sample MONTHS_BALANCE: Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) CNT_INSTALMENT: Term of previous credit (can change over time) CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit NAME_CONTRACT_STATUS: Contract status during the month SK_DPD: DPD (days past due) during the month of previous credit SK_DPD_DEF: DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit

# 3.3.1 There is one object type
pos.dtypes.value_counts()
float64    2
uint32     2
uint16     2
object     1
int8       1
dtype: int64
# 4.0 Transform object type columns to OHE
pos, cat_cols = one_hot_encoder(pos, nan_as_category= True)
# 4.1
pos.shape    # (10001358, 17)
pos.head()
(10001358, 17)
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Amortized debt NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Canceled NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Returned to the store NAME_CONTRACT_STATUS_Signed NAME_CONTRACT_STATUS_XNA NAME_CONTRACT_STATUS_nan
0 1803195 182943 -31 48.0 45.0 0 0 1 0 0 0 0 0 0 0 0 0
1 1715348 367990 -33 36.0 35.0 0 0 1 0 0 0 0 0 0 0 0 0
2 1784872 397406 -32 12.0 9.0 0 0 1 0 0 0 0 0 0 0 0 0
3 1903291 269225 -35 48.0 42.0 0 0 1 0 0 0 0 0 0 0 0 0
4 2341044 334279 -35 36.0 35.0 0 0 1 0 0 0 0 0 0 0 0 0
# 4.2 New columns are:
cat_cols
['NAME_CONTRACT_STATUS_Active',
 'NAME_CONTRACT_STATUS_Amortized debt',
 'NAME_CONTRACT_STATUS_Approved',
 'NAME_CONTRACT_STATUS_Canceled',
 'NAME_CONTRACT_STATUS_Completed',
 'NAME_CONTRACT_STATUS_Demand',
 'NAME_CONTRACT_STATUS_Returned to the store',
 'NAME_CONTRACT_STATUS_Signed',
 'NAME_CONTRACT_STATUS_XNA',
 'NAME_CONTRACT_STATUS_nan']
# 4.3 How to aggregate features:
#     Note CNT_INSTALMENT and CNT_INSTALMENT_FUTURE
#       do not find place:

aggregations = {
                'MONTHS_BALANCE': ['max', 'mean', 'size'],
                'SK_DPD':         ['max', 'mean'],
                'SK_DPD_DEF':     ['max', 'mean']
               }

# 4.3.1
for cat in cat_cols:
    aggregations[cat] = ['mean']
    
# 4.3.2 Full dictionary
aggregations
{'MONTHS_BALANCE': ['max', 'mean', 'size'],
 'SK_DPD': ['max', 'mean'],
 'SK_DPD_DEF': ['max', 'mean'],
 'NAME_CONTRACT_STATUS_Active': ['mean'],
 'NAME_CONTRACT_STATUS_Amortized debt': ['mean'],
 'NAME_CONTRACT_STATUS_Approved': ['mean'],
 'NAME_CONTRACT_STATUS_Canceled': ['mean'],
 'NAME_CONTRACT_STATUS_Completed': ['mean'],
 'NAME_CONTRACT_STATUS_Demand': ['mean'],
 'NAME_CONTRACT_STATUS_Returned to the store': ['mean'],
 'NAME_CONTRACT_STATUS_Signed': ['mean'],
 'NAME_CONTRACT_STATUS_XNA': ['mean'],
 'NAME_CONTRACT_STATUS_nan': ['mean']}
# 5.0 Aggregate now
grouped = pos.groupby('SK_ID_CURR')
pos_agg = grouped.agg(aggregations)
# 5.1
pos_agg.shape     # (337252, 17)
pos_agg.head()
pos_agg.columns
(337252, 17)
MONTHS_BALANCE SK_DPD SK_DPD_DEF NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Amortized debt NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Canceled NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Returned to the store NAME_CONTRACT_STATUS_Signed NAME_CONTRACT_STATUS_XNA NAME_CONTRACT_STATUS_nan
max mean size max mean max mean mean mean mean mean mean mean mean mean mean mean
SK_ID_CURR
100001 -53 -72.555556 9 7 0.777778 7 0.777778 0.777778 0.0 0.0 0.0 0.222222 0.0 0.0 0.000000 0.0 0
100002 -1 -10.000000 19 0 0.000000 0 0.000000 1.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0
100003 -18 -43.785714 28 0 0.000000 0 0.000000 0.928571 0.0 0.0 0.0 0.071429 0.0 0.0 0.000000 0.0 0
100004 -24 -25.500000 4 0 0.000000 0 0.000000 0.750000 0.0 0.0 0.0 0.250000 0.0 0.0 0.000000 0.0 0
100005 -15 -20.000000 11 0 0.000000 0 0.000000 0.818182 0.0 0.0 0.0 0.090909 0.0 0.0 0.090909 0.0 0
MultiIndex([(                            'MONTHS_BALANCE',  'max'),
            (                            'MONTHS_BALANCE', 'mean'),
            (                            'MONTHS_BALANCE', 'size'),
            (                                    'SK_DPD',  'max'),
            (                                    'SK_DPD', 'mean'),
            (                                'SK_DPD_DEF',  'max'),
            (                                'SK_DPD_DEF', 'mean'),
            (               'NAME_CONTRACT_STATUS_Active', 'mean'),
            (       'NAME_CONTRACT_STATUS_Amortized debt', 'mean'),
            (             'NAME_CONTRACT_STATUS_Approved', 'mean'),
            (             'NAME_CONTRACT_STATUS_Canceled', 'mean'),
            (            'NAME_CONTRACT_STATUS_Completed', 'mean'),
            (               'NAME_CONTRACT_STATUS_Demand', 'mean'),
            ('NAME_CONTRACT_STATUS_Returned to the store', 'mean'),
            (               'NAME_CONTRACT_STATUS_Signed', 'mean'),
            (                  'NAME_CONTRACT_STATUS_XNA', 'mean'),
            (                  'NAME_CONTRACT_STATUS_nan', 'mean')],
           )
# 5.2 Rename multiindex columns
pos_agg.columns = pd.Index(['POS_' + e[0] + "_" + e[1].upper() for e in pos_agg.columns.tolist()])
# 5.3
pos_agg.columns
pos_agg.head()
Index(['POS_MONTHS_BALANCE_MAX', 'POS_MONTHS_BALANCE_MEAN',
       'POS_MONTHS_BALANCE_SIZE', 'POS_SK_DPD_MAX', 'POS_SK_DPD_MEAN',
       'POS_SK_DPD_DEF_MAX', 'POS_SK_DPD_DEF_MEAN',
       'POS_NAME_CONTRACT_STATUS_Active_MEAN',
       'POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN',
       'POS_NAME_CONTRACT_STATUS_Approved_MEAN',
       'POS_NAME_CONTRACT_STATUS_Canceled_MEAN',
       'POS_NAME_CONTRACT_STATUS_Completed_MEAN',
       'POS_NAME_CONTRACT_STATUS_Demand_MEAN',
       'POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN',
       'POS_NAME_CONTRACT_STATUS_Signed_MEAN',
       'POS_NAME_CONTRACT_STATUS_XNA_MEAN',
       'POS_NAME_CONTRACT_STATUS_nan_MEAN'],
      dtype='object')
POS_MONTHS_BALANCE_MAX POS_MONTHS_BALANCE_MEAN POS_MONTHS_BALANCE_SIZE POS_SK_DPD_MAX POS_SK_DPD_MEAN POS_SK_DPD_DEF_MAX POS_SK_DPD_DEF_MEAN POS_NAME_CONTRACT_STATUS_Active_MEAN POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN POS_NAME_CONTRACT_STATUS_Approved_MEAN POS_NAME_CONTRACT_STATUS_Canceled_MEAN POS_NAME_CONTRACT_STATUS_Completed_MEAN POS_NAME_CONTRACT_STATUS_Demand_MEAN POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN POS_NAME_CONTRACT_STATUS_Signed_MEAN POS_NAME_CONTRACT_STATUS_XNA_MEAN POS_NAME_CONTRACT_STATUS_nan_MEAN
SK_ID_CURR
100001 -53 -72.555556 9 7 0.777778 7 0.777778 0.777778 0.0 0.0 0.0 0.222222 0.0 0.0 0.000000 0.0 0
100002 -1 -10.000000 19 0 0.000000 0 0.000000 1.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0
100003 -18 -43.785714 28 0 0.000000 0 0.000000 0.928571 0.0 0.0 0.0 0.071429 0.0 0.0 0.000000 0.0 0
100004 -24 -25.500000 4 0 0.000000 0 0.000000 0.750000 0.0 0.0 0.0 0.250000 0.0 0.0 0.000000 0.0 0
100005 -15 -20.000000 11 0 0.000000 0 0.000000 0.818182 0.0 0.0 0.0 0.090909 0.0 0.0 0.090909 0.0 0
# 5.4 Count pos cash accounts
#     Per client how many entries/rows exist
pos_agg['POS_COUNT'] = pos.groupby('SK_ID_CURR').size()
pos_agg.head()
POS_MONTHS_BALANCE_MAX POS_MONTHS_BALANCE_MEAN POS_MONTHS_BALANCE_SIZE POS_SK_DPD_MAX POS_SK_DPD_MEAN POS_SK_DPD_DEF_MAX POS_SK_DPD_DEF_MEAN POS_NAME_CONTRACT_STATUS_Active_MEAN POS_NAME_CONTRACT_STATUS_Amortized debt_MEAN POS_NAME_CONTRACT_STATUS_Approved_MEAN POS_NAME_CONTRACT_STATUS_Canceled_MEAN POS_NAME_CONTRACT_STATUS_Completed_MEAN POS_NAME_CONTRACT_STATUS_Demand_MEAN POS_NAME_CONTRACT_STATUS_Returned to the store_MEAN POS_NAME_CONTRACT_STATUS_Signed_MEAN POS_NAME_CONTRACT_STATUS_XNA_MEAN POS_NAME_CONTRACT_STATUS_nan_MEAN POS_COUNT
SK_ID_CURR
100001 -53 -72.555556 9 7 0.777778 7 0.777778 0.777778 0.0 0.0 0.0 0.222222 0.0 0.0 0.000000 0.0 0 9
100002 -1 -10.000000 19 0 0.000000 0 0.000000 1.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0 19
100003 -18 -43.785714 28 0 0.000000 0 0.000000 0.928571 0.0 0.0 0.0 0.071429 0.0 0.0 0.000000 0.0 0 28
100004 -24 -25.500000 4 0 0.000000 0 0.000000 0.750000 0.0 0.0 0.0 0.250000 0.0 0.0 0.000000 0.0 0 4
100005 -15 -20.000000 11 0 0.000000 0 0.000000 0.818182 0.0 0.0 0.0 0.090909 0.0 0.0 0.090909 0.0 0 11
# 6.0 Save the results for subsequent use:
pos_agg.to_csv("processed_pos_agg.csv.zip", compression = "zip")   
    
######################