Install payments¶
About data¶
It is payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
Feature explanations¶
SK_ID_PREV : ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)
SK_ID_CURR: ID of loan in our sample
NUM_INSTALMENT_VERSION: Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed
NUM_INSTALMENT_NUMBER: On which installment we observe payment
DAYS_INSTALMENT: When the installment of previous credit was supposed to be paid (relative to application date of current loan)
DAYS_ENTRY_PAYMENT: When was the installments of previous credit paid actually (relative to application date of current loan)
AMT_INSTALMENT: What was the prescribed installment amount of previous credit on this installment
AMT_PAYMENT: What the client actually paid on previous credit on this installment
# Last amended: 24th October, 2020
# Myfolder: C:\Users\Administrator\OneDrive\Documents\home_credit_default_risk
# Objective:
# Solving Kaggle problem: Home Credit Default Risk
# Processing installment_payments dataset
#
# Data Source: https://www.kaggle.com/c/home-credit-default-risk/data
# Ref: https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features
# 1.0 Libraries
# (Some of these may not be needed here.)
%reset -f
import numpy as np
import pandas as pd
import gc
# 1.1 Reduce read data size
# There is a file reducing.py
# in this folder. A class
# in it is used to reduce
# dataframe size
# (Code modified by me to
# exclude 'category' dtype)
import reducing
# 1.2 Misc
import warnings
import os
warnings.simplefilter(action='ignore', category=FutureWarning)
# 1.3
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
# 1.4 Display multiple commands outputs from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.0 Onehot encoding (OHE) function. Uses pd.get_dummies()
# i) To transform 'object' columns to dummies.
# ii) Treat NaN as one of the categories
# iii) Returns transformed-data and new-columns created
def one_hot_encoder(df, nan_as_category = True):
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df,
columns= categorical_columns,
dummy_na= nan_as_category # Treat NaNs as category
)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns
# 3.0 Prepare to read data
pathToData = "C:\\Users\\Administrator\\OneDrive\\Documents\\home_credit_default_risk"
os.chdir(pathToData)
# 2.2 Some constants
num_rows=None # Implies read all rows
nan_as_category = True # While transforming
# 'object' columns to dummies
# 3.0 Read previous application data first
ins = pd.read_csv(
'installments_payments.csv.zip',
nrows = num_rows
)
# 3.0.1 Reduce memory usage by appropriately
# changing data-types per feature:
ins = reducing.Reducer().reduce(ins)
reduced df from 830.4079 MB to 493.0547 MB in 6.13 seconds
# 3.1
ins.shape # (13605401, 8)
ins.head()
(13605401, 8)
SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
---|---|---|---|---|---|---|---|---|
0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.359863 | 6948.360 |
1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525024 | 1716.525 |
2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000000 | 25425.000 |
3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130859 | 24350.130 |
4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040039 | 2160.585 |
# 3.2 No object type column
ins.dtypes.value_counts()
float32 3
float64 2
uint32 2
uint16 1
dtype: int64
# 3.3 OHE any object column
ins, cat_cols = one_hot_encoder(ins, nan_as_category= True)
# 3.3.1 This dataset does not have any object feature
cat_cols
[]
# 3.4
ins.shape # 13605401, 8)
ins.head()
(13605401, 8)
SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
---|---|---|---|---|---|---|---|---|
0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.359863 | 6948.360 |
1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525024 | 1716.525 |
2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000000 | 25425.000 |
3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130859 | 24350.130 |
4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040039 | 2160.585 |
# 4.0 Percentage and difference paid in each installment (amount paid and installment value)
ins['PAYMENT_PERC'] = ins['AMT_PAYMENT'] / ins['AMT_INSTALMENT']
ins['PAYMENT_DIFF'] = ins['AMT_INSTALMENT'] - ins['AMT_PAYMENT']
# 4.1 Days past due and days before due (no negative values)
ins['DPD'] = ins['DAYS_ENTRY_PAYMENT'] - ins['DAYS_INSTALMENT']
ins['DBD'] = ins['DAYS_INSTALMENT'] - ins['DAYS_ENTRY_PAYMENT']
ins['DPD'] = ins['DPD'].apply(lambda x: x if x > 0 else 0)
ins['DBD'] = ins['DBD'].apply(lambda x: x if x > 0 else 0)
# 4.2 How to perform aggregations?
# For numeric columns
aggregations = {
'NUM_INSTALMENT_VERSION': ['nunique'],
'DPD': ['max', 'mean', 'sum'],
'DBD': ['max', 'mean', 'sum'],
'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
'AMT_INSTALMENT': ['max', 'mean', 'sum'],
'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']
}
# 4.2.1 For categorical columns
for cat in cat_cols:
aggregations[cat] = ['mean']
# 4.2.2
aggregations
{'NUM_INSTALMENT_VERSION': ['nunique'],
'DPD': ['max', 'mean', 'sum'],
'DBD': ['max', 'mean', 'sum'],
'PAYMENT_PERC': ['max', 'mean', 'sum', 'var'],
'PAYMENT_DIFF': ['max', 'mean', 'sum', 'var'],
'AMT_INSTALMENT': ['max', 'mean', 'sum'],
'AMT_PAYMENT': ['min', 'max', 'mean', 'sum'],
'DAYS_ENTRY_PAYMENT': ['max', 'mean', 'sum']}
# 4.3 Perform aggregation now
grouped = ins.groupby('SK_ID_CURR')
ins_agg= grouped.agg(aggregations)
# 4.4
ins_agg.shape
ins_agg.head()
(339587, 25)
NUM_INSTALMENT_VERSION | DPD | DBD | PAYMENT_PERC | PAYMENT_DIFF | AMT_INSTALMENT | AMT_PAYMENT | DAYS_ENTRY_PAYMENT | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nunique | max | mean | sum | max | mean | sum | max | mean | sum | var | max | mean | sum | var | max | mean | sum | min | max | mean | sum | max | mean | sum | |
SK_ID_CURR | |||||||||||||||||||||||||
100001 | 2 | 11.0 | 1.571429 | 11.0 | 36.0 | 8.857143 | 62.0 | 1.0 | 1.0 | 7.000000 | 7.564931e-17 | 0.000391 | 0.000077 | 0.000537 | 1.975468e-08 | 17397.900391 | 5885.132324 | 4.119593e+04 | 3951.000 | 17397.900 | 5885.132143 | 41195.925 | -1628.0 | -2195.000000 | -15365.0 |
100002 | 2 | 0.0 | 0.000000 | 0.0 | 31.0 | 20.421053 | 388.0 | 1.0 | 1.0 | 18.999999 | 2.460418e-17 | 0.001094 | 0.000428 | 0.008125 | 2.602025e-08 | 53093.746094 | 11559.247070 | 2.196257e+05 | 9251.775 | 53093.745 | 11559.247105 | 219625.695 | -49.0 | -315.421053 | -5993.0 |
100003 | 2 | 0.0 | 0.000000 | 0.0 | 14.0 | 7.160000 | 179.0 | 1.0 | 1.0 | 25.000000 | 3.633997e-16 | 0.015000 | -0.000297 | -0.007422 | 1.157042e-05 | 560835.375000 | 64754.585938 | 1.618865e+06 | 6662.970 | 560835.360 | 64754.586000 | 1618864.650 | -544.0 | -1385.320000 | -34633.0 |
100004 | 2 | 0.0 | 0.000000 | 0.0 | 11.0 | 7.666667 | 23.0 | 1.0 | 1.0 | 3.000000 | 7.278520e-17 | 0.000000 | -0.000052 | -0.000156 | 8.138021e-09 | 10573.964844 | 7096.154785 | 2.128846e+04 | 5357.250 | 10573.965 | 7096.155000 | 21288.465 | -727.0 | -761.666667 | -2285.0 |
100005 | 2 | 1.0 | 0.111111 | 1.0 | 37.0 | 23.666667 | 213.0 | 1.0 | 1.0 | 9.000000 | 8.850842e-16 | 0.000195 | 0.000078 | 0.000703 | 1.235962e-07 | 17656.244141 | 6240.205078 | 5.616184e+04 | 4813.200 | 17656.245 | 6240.205000 | 56161.845 | -470.0 | -609.555556 | -5486.0 |
# 4.5 Rename columns
ins_agg.columns = pd.Index(['INSTAL_' + e[0] + "_" + e[1].upper() for e in ins_agg.columns.tolist()])
# 4.6
ins_agg.shape
ins_agg.head()
(339587, 25)
INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE | INSTAL_DPD_MAX | INSTAL_DPD_MEAN | INSTAL_DPD_SUM | INSTAL_DBD_MAX | INSTAL_DBD_MEAN | INSTAL_DBD_SUM | INSTAL_PAYMENT_PERC_MAX | INSTAL_PAYMENT_PERC_MEAN | INSTAL_PAYMENT_PERC_SUM | INSTAL_PAYMENT_PERC_VAR | INSTAL_PAYMENT_DIFF_MAX | INSTAL_PAYMENT_DIFF_MEAN | INSTAL_PAYMENT_DIFF_SUM | INSTAL_PAYMENT_DIFF_VAR | INSTAL_AMT_INSTALMENT_MAX | INSTAL_AMT_INSTALMENT_MEAN | INSTAL_AMT_INSTALMENT_SUM | INSTAL_AMT_PAYMENT_MIN | INSTAL_AMT_PAYMENT_MAX | INSTAL_AMT_PAYMENT_MEAN | INSTAL_AMT_PAYMENT_SUM | INSTAL_DAYS_ENTRY_PAYMENT_MAX | INSTAL_DAYS_ENTRY_PAYMENT_MEAN | INSTAL_DAYS_ENTRY_PAYMENT_SUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | |||||||||||||||||||||||||
100001 | 2 | 11.0 | 1.571429 | 11.0 | 36.0 | 8.857143 | 62.0 | 1.0 | 1.0 | 7.000000 | 7.564931e-17 | 0.000391 | 0.000077 | 0.000537 | 1.975468e-08 | 17397.900391 | 5885.132324 | 4.119593e+04 | 3951.000 | 17397.900 | 5885.132143 | 41195.925 | -1628.0 | -2195.000000 | -15365.0 |
100002 | 2 | 0.0 | 0.000000 | 0.0 | 31.0 | 20.421053 | 388.0 | 1.0 | 1.0 | 18.999999 | 2.460418e-17 | 0.001094 | 0.000428 | 0.008125 | 2.602025e-08 | 53093.746094 | 11559.247070 | 2.196257e+05 | 9251.775 | 53093.745 | 11559.247105 | 219625.695 | -49.0 | -315.421053 | -5993.0 |
100003 | 2 | 0.0 | 0.000000 | 0.0 | 14.0 | 7.160000 | 179.0 | 1.0 | 1.0 | 25.000000 | 3.633997e-16 | 0.015000 | -0.000297 | -0.007422 | 1.157042e-05 | 560835.375000 | 64754.585938 | 1.618865e+06 | 6662.970 | 560835.360 | 64754.586000 | 1618864.650 | -544.0 | -1385.320000 | -34633.0 |
100004 | 2 | 0.0 | 0.000000 | 0.0 | 11.0 | 7.666667 | 23.0 | 1.0 | 1.0 | 3.000000 | 7.278520e-17 | 0.000000 | -0.000052 | -0.000156 | 8.138021e-09 | 10573.964844 | 7096.154785 | 2.128846e+04 | 5357.250 | 10573.965 | 7096.155000 | 21288.465 | -727.0 | -761.666667 | -2285.0 |
100005 | 2 | 1.0 | 0.111111 | 1.0 | 37.0 | 23.666667 | 213.0 | 1.0 | 1.0 | 9.000000 | 8.850842e-16 | 0.000195 | 0.000078 | 0.000703 | 1.235962e-07 | 17656.244141 | 6240.205078 | 5.616184e+04 | 4813.200 | 17656.245 | 6240.205000 | 56161.845 | -470.0 | -609.555556 | -5486.0 |
# 4.7 Create one more column. Per client how many installments accounts
ins_agg['INSTAL_COUNT'] = ins.groupby('SK_ID_CURR').size()
# 4.8
ins_agg.shape
ins_agg.head()
(339587, 26)
INSTAL_NUM_INSTALMENT_VERSION_NUNIQUE | INSTAL_DPD_MAX | INSTAL_DPD_MEAN | INSTAL_DPD_SUM | INSTAL_DBD_MAX | INSTAL_DBD_MEAN | INSTAL_DBD_SUM | INSTAL_PAYMENT_PERC_MAX | INSTAL_PAYMENT_PERC_MEAN | INSTAL_PAYMENT_PERC_SUM | INSTAL_PAYMENT_PERC_VAR | INSTAL_PAYMENT_DIFF_MAX | INSTAL_PAYMENT_DIFF_MEAN | INSTAL_PAYMENT_DIFF_SUM | INSTAL_PAYMENT_DIFF_VAR | INSTAL_AMT_INSTALMENT_MAX | INSTAL_AMT_INSTALMENT_MEAN | INSTAL_AMT_INSTALMENT_SUM | INSTAL_AMT_PAYMENT_MIN | INSTAL_AMT_PAYMENT_MAX | INSTAL_AMT_PAYMENT_MEAN | INSTAL_AMT_PAYMENT_SUM | INSTAL_DAYS_ENTRY_PAYMENT_MAX | INSTAL_DAYS_ENTRY_PAYMENT_MEAN | INSTAL_DAYS_ENTRY_PAYMENT_SUM | INSTAL_COUNT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | ||||||||||||||||||||||||||
100001 | 2 | 11.0 | 1.571429 | 11.0 | 36.0 | 8.857143 | 62.0 | 1.0 | 1.0 | 7.000000 | 7.564931e-17 | 0.000391 | 0.000077 | 0.000537 | 1.975468e-08 | 17397.900391 | 5885.132324 | 4.119593e+04 | 3951.000 | 17397.900 | 5885.132143 | 41195.925 | -1628.0 | -2195.000000 | -15365.0 | 7 |
100002 | 2 | 0.0 | 0.000000 | 0.0 | 31.0 | 20.421053 | 388.0 | 1.0 | 1.0 | 18.999999 | 2.460418e-17 | 0.001094 | 0.000428 | 0.008125 | 2.602025e-08 | 53093.746094 | 11559.247070 | 2.196257e+05 | 9251.775 | 53093.745 | 11559.247105 | 219625.695 | -49.0 | -315.421053 | -5993.0 | 19 |
100003 | 2 | 0.0 | 0.000000 | 0.0 | 14.0 | 7.160000 | 179.0 | 1.0 | 1.0 | 25.000000 | 3.633997e-16 | 0.015000 | -0.000297 | -0.007422 | 1.157042e-05 | 560835.375000 | 64754.585938 | 1.618865e+06 | 6662.970 | 560835.360 | 64754.586000 | 1618864.650 | -544.0 | -1385.320000 | -34633.0 | 25 |
100004 | 2 | 0.0 | 0.000000 | 0.0 | 11.0 | 7.666667 | 23.0 | 1.0 | 1.0 | 3.000000 | 7.278520e-17 | 0.000000 | -0.000052 | -0.000156 | 8.138021e-09 | 10573.964844 | 7096.154785 | 2.128846e+04 | 5357.250 | 10573.965 | 7096.155000 | 21288.465 | -727.0 | -761.666667 | -2285.0 | 3 |
100005 | 2 | 1.0 | 0.111111 | 1.0 | 37.0 | 23.666667 | 213.0 | 1.0 | 1.0 | 9.000000 | 8.850842e-16 | 0.000195 | 0.000078 | 0.000703 | 1.235962e-07 | 17656.244141 | 6240.205078 | 5.616184e+04 | 4813.200 | 17656.245 | 6240.205000 | 56161.845 | -470.0 | -609.555556 | -5486.0 | 9 |
# 5.0 Save the results for subsequent use:
ins_agg.to_csv("processed_ins_agg.csv.zip", compression = "zip")
##############