Install payments¶

About data¶

It is payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Feature explanations¶

SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
NUM_INSTALMENT_VERSION: Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed
NUM_INSTALMENT_NUMBER: On which installment we observe payment
DAYS_INSTALMENT: When the installment of previous credit was supposed to be paid (relative to application date of current loan)
DAYS_ENTRY_PAYMENT: When was the installments of previous credit paid actually (relative to application date of current loan)
AMT_INSTALMENT: What was the prescribed installment amount of previous credit on this installment
AMT_PAYMENT: What the client actually paid on previous credit on this installment

# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective: 
#           Solving Kaggle problem: Home Credit Default Risk
#           Processing installment_payments dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features

# 1.0 Libraries
#     (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc

# 1.1 Reduce read data size
#     There is a file reducing.py
#      in this folder. A class
#       in it is used to reduce
#        dataframe size
#     (Code modified by me to
#      exclude 'category' dtype)
import reducing

# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)

# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
#     i) To transform 'object' columns to dummies. 
#    ii) Treat NaN as one of the categories
#   iii) Returns transformed-data and new-columns created

def one_hot_encoder(df, nan_as_category = True):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df,
                        columns= categorical_columns,
                        dummy_na= nan_as_category       # Treat NaNs as category
                       )
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)

# 2.2 Some constants
num_rows=None                # Implies read all rows
nan_as_category = True       # While transforming 
                             #   'object' columns to dummies

# 3.0 Read previous application data first
ins = pd.read_csv(
                   'installments_payments.csv.zip',
                   nrows = num_rows
                   )

# 3.0.1 Reduce memory usage by appropriately
#       changing data-types per feature:

ins = reducing.Reducer().reduce(ins)

reduced df from 830.4079 MB to 493.0547 MB in 6.13 seconds

# 3.1
ins.shape   # (13605401, 8)
ins.head()

(13605401, 8)

	SK_ID_PREV	SK_ID_CURR	NUM_INSTALMENT_VERSION	NUM_INSTALMENT_NUMBER	DAYS_INSTALMENT	DAYS_ENTRY_PAYMENT	AMT_INSTALMENT	AMT_PAYMENT
0	1054186	161674	1.0	6	-1180.0	-1187.0	6948.359863	6948.360
1	1330831	151639	0.0	34	-2156.0	-2156.0	1716.525024	1716.525
2	2085231	193053	2.0	1	-63.0	-63.0	25425.000000	25425.000
3	2452527	199697	1.0	3	-2418.0	-2426.0	24350.130859	24350.130
4	2714724	167756	1.0	2	-1383.0	-1366.0	2165.040039	2160.585

# 3.2 No object type column
ins.dtypes.value_counts()

float32    3
float64    2
uint32     2
uint16     1
dtype: int64

# 3.3 OHE any object column
ins, cat_cols = one_hot_encoder(ins, nan_as_category= True)

# 3.3.1 This dataset does not have any object feature
cat_cols

[]

# 3.4
ins.shape   # 13605401, 8)
ins.head()

(13605401, 8)

	SK_ID_PREV	SK_ID_CURR	NUM_INSTALMENT_VERSION	NUM_INSTALMENT_NUMBER	DAYS_INSTALMENT	DAYS_ENTRY_PAYMENT	AMT_INSTALMENT	AMT_PAYMENT
0	1054186	161674	1.0	6	-1180.0	-1187.0	6948.359863	6948.360
1	1330831	151639	0.0	34	-2156.0	-2156.0	1716.525024	1716.525
2	2085231	193053	2.0	1	-63.0	-63.0	25425.000000	25425.000
3	2452527	199697	1.0	3	-2418.0	-2426.0	24350.130859	24350.130
4	2714724	167756	1.0	2	-1383.0	-1366.0	2165.040039	2160.585

# 4.0 Percentage and difference paid in each installment (amount paid and installment value)
ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']

# 4.1 Days past due and days before due (no negative values)
ins['DPD'] = ins['DAYS_ENTRY_PAYMENT'] - ins['DAYS_INSTALMENT']
ins['DBD'] = ins['DAYS_INSTALMENT'] - ins['DAYS_ENTRY_PAYMENT']
ins['DPD'] = ins['DPD'].apply(lambda x: x if x > 0 else 0)
ins['DBD'] = ins['DBD'].apply(lambda x: x if x > 0 else 0)

# 4.2 How to perform aggregations?
#     For numeric columns
aggregations = {
                 'NUM_INSTALMENT_VERSION': ['nunique'],
                 'DPD': ['max', 'mean', 'sum'],
                 'DBD': ['max', 'mean', 'sum'],
                 'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
                 'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
                 'AMT_INSTALMENT': ['max', 'mean', 'sum'],
                 'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
                 'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']
               }

# 4.2.1 For categorical columns
for cat in cat_cols:
    aggregations[cat] = ['mean']  

# 4.2.2
aggregations

{'NUM_INSTALMENT_VERSION': ['nunique'],
 'DPD': ['max', 'mean', 'sum'],
 'DBD': ['max', 'mean', 'sum'],
 'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
 'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
 'AMT_INSTALMENT': ['max', 'mean', 'sum'],
 'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
 'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']}

# 4.3 Perform aggregation now
grouped = ins.groupby('SK_ID_CURR')
ins_agg= grouped.agg(aggregations)

# 4.4
ins_agg.shape
ins_agg.head()

(339587, 25)

	NUM_INSTALMENT_VERSION	DPD			DBD			PAYMENT_PERC				PAYMENT_DIFF				AMT_INSTALMENT			AMT_PAYMENT				DAYS_ENTRY_PAYMENT
	nunique	max	mean	sum	max	mean	sum	max	mean	sum	var	max	mean	sum	var	max	mean	sum	min	max	mean	sum	max	mean	sum
SK_ID_CURR
100001	2	11.0	1.571429	11.0	36.0	8.857143	62.0	1.0	1.0	7.000000	7.564931e-17	0.000391	0.000077	0.000537	1.975468e-08	17397.900391	5885.132324	4.119593e+04	3951.000	17397.900	5885.132143	41195.925	-1628.0	-2195.000000	-15365.0
100002	2	0.0	0.000000	0.0	31.0	20.421053	388.0	1.0	1.0	18.999999	2.460418e-17	0.001094	0.000428	0.008125	2.602025e-08	53093.746094	11559.247070	2.196257e+05	9251.775	53093.745	11559.247105	219625.695	-49.0	-315.421053	-5993.0
100003	2	0.0	0.000000	0.0	14.0	7.160000	179.0	1.0	1.0	25.000000	3.633997e-16	0.015000	-0.000297	-0.007422	1.157042e-05	560835.375000	64754.585938	1.618865e+06	6662.970	560835.360	64754.586000	1618864.650	-544.0	-1385.320000	-34633.0
100004	2	0.0	0.000000	0.0	11.0	7.666667	23.0	1.0	1.0	3.000000	7.278520e-17	0.000000	-0.000052	-0.000156	8.138021e-09	10573.964844	7096.154785	2.128846e+04	5357.250	10573.965	7096.155000	21288.465	-727.0	-761.666667	-2285.0
100005	2	1.0	0.111111	1.0	37.0	23.666667	213.0	1.0	1.0	9.000000	8.850842e-16	0.000195	0.000078	0.000703	1.235962e-07	17656.244141	6240.205078	5.616184e+04	4813.200	17656.245	6240.205000	56161.845	-470.0	-609.555556	-5486.0

# 4.5 Rename columns
ins_agg.columns = pd.Index(['INSTAL_' + e[0] + "_" + e[1].upper() for e in ins_agg.columns.tolist()])

# 4.6
ins_agg.shape
ins_agg.head()

(339587, 25)

	INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE	INSTAL_DPD_MAX	INSTAL_DPD_MEAN	INSTAL_DPD_SUM	INSTAL_DBD_MAX	INSTAL_DBD_MEAN	INSTAL_DBD_SUM	INSTAL_PAYMENT_PERC_MAX	INSTAL_PAYMENT_PERC_MEAN	INSTAL_PAYMENT_PERC_SUM	INSTAL_PAYMENT_PERC_VAR	INSTAL_PAYMENT_DIFF_MAX	INSTAL_PAYMENT_DIFF_MEAN	INSTAL_PAYMENT_DIFF_SUM	INSTAL_PAYMENT_DIFF_VAR	INSTAL_AMT_INSTALMENT_MAX	INSTAL_AMT_INSTALMENT_MEAN	INSTAL_AMT_INSTALMENT_SUM	INSTAL_AMT_PAYMENT_MIN	INSTAL_AMT_PAYMENT_MAX	INSTAL_AMT_PAYMENT_MEAN	INSTAL_AMT_PAYMENT_SUM	INSTAL_DAYS_ENTRY_PAYMENT_MAX	INSTAL_DAYS_ENTRY_PAYMENT_MEAN	INSTAL_DAYS_ENTRY_PAYMENT_SUM
SK_ID_CURR
100001	2	11.0	1.571429	11.0	36.0	8.857143	62.0	1.0	1.0	7.000000	7.564931e-17	0.000391	0.000077	0.000537	1.975468e-08	17397.900391	5885.132324	4.119593e+04	3951.000	17397.900	5885.132143	41195.925	-1628.0	-2195.000000	-15365.0
100002	2	0.0	0.000000	0.0	31.0	20.421053	388.0	1.0	1.0	18.999999	2.460418e-17	0.001094	0.000428	0.008125	2.602025e-08	53093.746094	11559.247070	2.196257e+05	9251.775	53093.745	11559.247105	219625.695	-49.0	-315.421053	-5993.0
100003	2	0.0	0.000000	0.0	14.0	7.160000	179.0	1.0	1.0	25.000000	3.633997e-16	0.015000	-0.000297	-0.007422	1.157042e-05	560835.375000	64754.585938	1.618865e+06	6662.970	560835.360	64754.586000	1618864.650	-544.0	-1385.320000	-34633.0
100004	2	0.0	0.000000	0.0	11.0	7.666667	23.0	1.0	1.0	3.000000	7.278520e-17	0.000000	-0.000052	-0.000156	8.138021e-09	10573.964844	7096.154785	2.128846e+04	5357.250	10573.965	7096.155000	21288.465	-727.0	-761.666667	-2285.0
100005	2	1.0	0.111111	1.0	37.0	23.666667	213.0	1.0	1.0	9.000000	8.850842e-16	0.000195	0.000078	0.000703	1.235962e-07	17656.244141	6240.205078	5.616184e+04	4813.200	17656.245	6240.205000	56161.845	-470.0	-609.555556	-5486.0

# 4.7 Create one more column. Per client how many installments accounts
ins_agg['INSTAL_COUNT'] = ins.groupby('SK_ID_CURR').size()

# 4.8
ins_agg.shape
ins_agg.head()

(339587, 26)

	INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE	INSTAL_DPD_MAX	INSTAL_DPD_MEAN	INSTAL_DPD_SUM	INSTAL_DBD_MAX	INSTAL_DBD_MEAN	INSTAL_DBD_SUM	INSTAL_PAYMENT_PERC_MAX	INSTAL_PAYMENT_PERC_MEAN	INSTAL_PAYMENT_PERC_SUM	INSTAL_PAYMENT_PERC_VAR	INSTAL_PAYMENT_DIFF_MAX	INSTAL_PAYMENT_DIFF_MEAN	INSTAL_PAYMENT_DIFF_SUM	INSTAL_PAYMENT_DIFF_VAR	INSTAL_AMT_INSTALMENT_MAX	INSTAL_AMT_INSTALMENT_MEAN	INSTAL_AMT_INSTALMENT_SUM	INSTAL_AMT_PAYMENT_MIN	INSTAL_AMT_PAYMENT_MAX	INSTAL_AMT_PAYMENT_MEAN	INSTAL_AMT_PAYMENT_SUM	INSTAL_DAYS_ENTRY_PAYMENT_MAX	INSTAL_DAYS_ENTRY_PAYMENT_MEAN	INSTAL_DAYS_ENTRY_PAYMENT_SUM	INSTAL_COUNT
SK_ID_CURR
100001	2	11.0	1.571429	11.0	36.0	8.857143	62.0	1.0	1.0	7.000000	7.564931e-17	0.000391	0.000077	0.000537	1.975468e-08	17397.900391	5885.132324	4.119593e+04	3951.000	17397.900	5885.132143	41195.925	-1628.0	-2195.000000	-15365.0	7
100002	2	0.0	0.000000	0.0	31.0	20.421053	388.0	1.0	1.0	18.999999	2.460418e-17	0.001094	0.000428	0.008125	2.602025e-08	53093.746094	11559.247070	2.196257e+05	9251.775	53093.745	11559.247105	219625.695	-49.0	-315.421053	-5993.0	19
100003	2	0.0	0.000000	0.0	14.0	7.160000	179.0	1.0	1.0	25.000000	3.633997e-16	0.015000	-0.000297	-0.007422	1.157042e-05	560835.375000	64754.585938	1.618865e+06	6662.970	560835.360	64754.586000	1618864.650	-544.0	-1385.320000	-34633.0	25
100004	2	0.0	0.000000	0.0	11.0	7.666667	23.0	1.0	1.0	3.000000	7.278520e-17	0.000000	-0.000052	-0.000156	8.138021e-09	10573.964844	7096.154785	2.128846e+04	5357.250	10573.965	7096.155000	21288.465	-727.0	-761.666667	-2285.0	3
100005	2	1.0	0.111111	1.0	37.0	23.666667	213.0	1.0	1.0	9.000000	8.850842e-16	0.000195	0.000078	0.000703	1.235962e-07	17656.244141	6240.205078	5.616184e+04	4813.200	17656.245	6240.205000	56161.845	-470.0	-609.555556	-5486.0	9

# 5.0 Save the results for subsequent use:
ins_agg.to_csv("processed_ins_agg.csv.zip", compression = "zip")   

##############

Credit Card balance data Joining all processed data