Install payments

About data

It is payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Feature explanations

SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
NUM_INSTALMENT_VERSION: Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed
NUM_INSTALMENT_NUMBER: On which installment we observe payment
DAYS_INSTALMENT: When the installment of previous credit was supposed to be paid (relative to application date of current loan)
DAYS_ENTRY_PAYMENT: When was the installments of previous credit paid actually (relative to application date of current loan)
AMT_INSTALMENT: What was the prescribed installment amount of previous credit on this installment
AMT_PAYMENT: What the client actually paid on previous credit on this installment

# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing installment_payments dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 2.2 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies
# 3.0 Read previous application data first
ins = pd.read_csv(
                   'installments_payments.csv.zip',
                   nrows = num_rows
                   )

# 3.0.1 Reduce memory usage by appropriately
#       changing data-types per feature:

ins = reducing.Reducer().reduce(ins)
reduced df from 830.4079 MB to 493.0547 MB in 6.13 seconds
# 3.1
ins.shape   # (13605401, 8)
ins.head()
(13605401, 8)
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.359863 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525024 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130859 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040039 2160.585
# 3.2 No object type column
ins.dtypes.value_counts()
float32    3
float64    2
uint32     2
uint16     1
dtype: int64
# 3.3 OHE any object column
ins, cat_cols = one_hot_encoder(ins, nan_as_category= True)
# 3.3.1 This dataset does not have any object feature
cat_cols
[]
# 3.4
ins.shape   # 13605401, 8)
ins.head()
(13605401, 8)
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.359863 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525024 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130859 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040039 2160.585
# 4.0 Percentage and difference paid in each installment (amount paid and installment value)
ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
# 4.1 Days past due and days before due (no negative values)
ins['DPD'] = ins['DAYS_ENTRY_PAYMENT'] - ins['DAYS_INSTALMENT']
ins['DBD'] = ins['DAYS_INSTALMENT'] - ins['DAYS_ENTRY_PAYMENT']
ins['DPD'] = ins['DPD'].apply(lambda x: x if x > 0 else 0)
ins['DBD'] = ins['DBD'].apply(lambda x: x if x > 0 else 0)
# 4.2 How to perform aggregations?
#     For numeric columns
aggregations = {
                 'NUM_INSTALMENT_VERSION': ['nunique'],
                 'DPD': ['max', 'mean', 'sum'],
                 'DBD': ['max', 'mean', 'sum'],
                 'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
                 'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
                 'AMT_INSTALMENT': ['max', 'mean', 'sum'],
                 'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
                 'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']
               }

# 4.2.1 For categorical columns
for cat in cat_cols:
    aggregations[cat] = ['mean']  
# 4.2.2
aggregations
{'NUM_INSTALMENT_VERSION': ['nunique'],
 'DPD': ['max', 'mean', 'sum'],
 'DBD': ['max', 'mean', 'sum'],
 'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
 'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
 'AMT_INSTALMENT': ['max', 'mean', 'sum'],
 'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
 'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']}
# 4.3 Perform aggregation now
grouped = ins.groupby('SK_ID_CURR')
ins_agg= grouped.agg(aggregations)
# 4.4
ins_agg.shape
ins_agg.head()
(339587, 25)
NUM_INSTALMENT_VERSION DPD DBD PAYMENT_PERC PAYMENT_DIFF AMT_INSTALMENT AMT_PAYMENT DAYS_ENTRY_PAYMENT
nunique max mean sum max mean sum max mean sum var max mean sum var max mean sum min max mean sum max mean sum
SK_ID_CURR
100001 2 11.0 1.571429 11.0 36.0 8.857143 62.0 1.0 1.0 7.000000 7.564931e-17 0.000391 0.000077 0.000537 1.975468e-08 17397.900391 5885.132324 4.119593e+04 3951.000 17397.900 5885.132143 41195.925 -1628.0 -2195.000000 -15365.0
100002 2 0.0 0.000000 0.0 31.0 20.421053 388.0 1.0 1.0 18.999999 2.460418e-17 0.001094 0.000428 0.008125 2.602025e-08 53093.746094 11559.247070 2.196257e+05 9251.775 53093.745 11559.247105 219625.695 -49.0 -315.421053 -5993.0
100003 2 0.0 0.000000 0.0 14.0 7.160000 179.0 1.0 1.0 25.000000 3.633997e-16 0.015000 -0.000297 -0.007422 1.157042e-05 560835.375000 64754.585938 1.618865e+06 6662.970 560835.360 64754.586000 1618864.650 -544.0 -1385.320000 -34633.0
100004 2 0.0 0.000000 0.0 11.0 7.666667 23.0 1.0 1.0 3.000000 7.278520e-17 0.000000 -0.000052 -0.000156 8.138021e-09 10573.964844 7096.154785 2.128846e+04 5357.250 10573.965 7096.155000 21288.465 -727.0 -761.666667 -2285.0
100005 2 1.0 0.111111 1.0 37.0 23.666667 213.0 1.0 1.0 9.000000 8.850842e-16 0.000195 0.000078 0.000703 1.235962e-07 17656.244141 6240.205078 5.616184e+04 4813.200 17656.245 6240.205000 56161.845 -470.0 -609.555556 -5486.0
# 4.5 Rename columns
ins_agg.columns = pd.Index(['INSTAL_' + e[0] + "_" + e[1].upper() for e in ins_agg.columns.tolist()])
# 4.6
ins_agg.shape
ins_agg.head()
(339587, 25)
INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE INSTAL_DPD_MAX INSTAL_DPD_MEAN INSTAL_DPD_SUM INSTAL_DBD_MAX INSTAL_DBD_MEAN INSTAL_DBD_SUM INSTAL_PAYMENT_PERC_MAX INSTAL_PAYMENT_PERC_MEAN INSTAL_PAYMENT_PERC_SUM INSTAL_PAYMENT_PERC_VAR INSTAL_PAYMENT_DIFF_MAX INSTAL_PAYMENT_DIFF_MEAN INSTAL_PAYMENT_DIFF_SUM INSTAL_PAYMENT_DIFF_VAR INSTAL_AMT_INSTALMENT_MAX INSTAL_AMT_INSTALMENT_MEAN INSTAL_AMT_INSTALMENT_SUM INSTAL_AMT_PAYMENT_MIN INSTAL_AMT_PAYMENT_MAX INSTAL_AMT_PAYMENT_MEAN INSTAL_AMT_PAYMENT_SUM INSTAL_DAYS_ENTRY_PAYMENT_MAX INSTAL_DAYS_ENTRY_PAYMENT_MEAN INSTAL_DAYS_ENTRY_PAYMENT_SUM
SK_ID_CURR
100001 2 11.0 1.571429 11.0 36.0 8.857143 62.0 1.0 1.0 7.000000 7.564931e-17 0.000391 0.000077 0.000537 1.975468e-08 17397.900391 5885.132324 4.119593e+04 3951.000 17397.900 5885.132143 41195.925 -1628.0 -2195.000000 -15365.0
100002 2 0.0 0.000000 0.0 31.0 20.421053 388.0 1.0 1.0 18.999999 2.460418e-17 0.001094 0.000428 0.008125 2.602025e-08 53093.746094 11559.247070 2.196257e+05 9251.775 53093.745 11559.247105 219625.695 -49.0 -315.421053 -5993.0
100003 2 0.0 0.000000 0.0 14.0 7.160000 179.0 1.0 1.0 25.000000 3.633997e-16 0.015000 -0.000297 -0.007422 1.157042e-05 560835.375000 64754.585938 1.618865e+06 6662.970 560835.360 64754.586000 1618864.650 -544.0 -1385.320000 -34633.0
100004 2 0.0 0.000000 0.0 11.0 7.666667 23.0 1.0 1.0 3.000000 7.278520e-17 0.000000 -0.000052 -0.000156 8.138021e-09 10573.964844 7096.154785 2.128846e+04 5357.250 10573.965 7096.155000 21288.465 -727.0 -761.666667 -2285.0
100005 2 1.0 0.111111 1.0 37.0 23.666667 213.0 1.0 1.0 9.000000 8.850842e-16 0.000195 0.000078 0.000703 1.235962e-07 17656.244141 6240.205078 5.616184e+04 4813.200 17656.245 6240.205000 56161.845 -470.0 -609.555556 -5486.0
# 4.7 Create one more column. Per client how many installments accounts
ins_agg['INSTAL_COUNT'] = ins.groupby('SK_ID_CURR').size()
# 4.8
ins_agg.shape
ins_agg.head()
(339587, 26)
INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE INSTAL_DPD_MAX INSTAL_DPD_MEAN INSTAL_DPD_SUM INSTAL_DBD_MAX INSTAL_DBD_MEAN INSTAL_DBD_SUM INSTAL_PAYMENT_PERC_MAX INSTAL_PAYMENT_PERC_MEAN INSTAL_PAYMENT_PERC_SUM INSTAL_PAYMENT_PERC_VAR INSTAL_PAYMENT_DIFF_MAX INSTAL_PAYMENT_DIFF_MEAN INSTAL_PAYMENT_DIFF_SUM INSTAL_PAYMENT_DIFF_VAR INSTAL_AMT_INSTALMENT_MAX INSTAL_AMT_INSTALMENT_MEAN INSTAL_AMT_INSTALMENT_SUM INSTAL_AMT_PAYMENT_MIN INSTAL_AMT_PAYMENT_MAX INSTAL_AMT_PAYMENT_MEAN INSTAL_AMT_PAYMENT_SUM INSTAL_DAYS_ENTRY_PAYMENT_MAX INSTAL_DAYS_ENTRY_PAYMENT_MEAN INSTAL_DAYS_ENTRY_PAYMENT_SUM INSTAL_COUNT
SK_ID_CURR
100001 2 11.0 1.571429 11.0 36.0 8.857143 62.0 1.0 1.0 7.000000 7.564931e-17 0.000391 0.000077 0.000537 1.975468e-08 17397.900391 5885.132324 4.119593e+04 3951.000 17397.900 5885.132143 41195.925 -1628.0 -2195.000000 -15365.0 7
100002 2 0.0 0.000000 0.0 31.0 20.421053 388.0 1.0 1.0 18.999999 2.460418e-17 0.001094 0.000428 0.008125 2.602025e-08 53093.746094 11559.247070 2.196257e+05 9251.775 53093.745 11559.247105 219625.695 -49.0 -315.421053 -5993.0 19
100003 2 0.0 0.000000 0.0 14.0 7.160000 179.0 1.0 1.0 25.000000 3.633997e-16 0.015000 -0.000297 -0.007422 1.157042e-05 560835.375000 64754.585938 1.618865e+06 6662.970 560835.360 64754.586000 1618864.650 -544.0 -1385.320000 -34633.0 25
100004 2 0.0 0.000000 0.0 11.0 7.666667 23.0 1.0 1.0 3.000000 7.278520e-17 0.000000 -0.000052 -0.000156 8.138021e-09 10573.964844 7096.154785 2.128846e+04 5357.250 10573.965 7096.155000 21288.465 -727.0 -761.666667 -2285.0 3
100005 2 1.0 0.111111 1.0 37.0 23.666667 213.0 1.0 1.0 9.000000 8.850842e-16 0.000195 0.000078 0.000703 1.235962e-07 17656.244141 6240.205078 5.616184e+04 4813.200 17656.245 6240.205000 56161.845 -470.0 -609.555556 -5486.0 9
# 5.0 Save the results for subsequent use:
ins_agg.to_csv("processed_ins_agg.csv.zip", compression = "zip")   
##############