Featured

AMEX - Default Prediction Kaggle Competition Summary

Vincent Yuan · updated on Jul 7, 2024 · 5 min read

Tech AI Python TensorFlow

1 Overview

American Express (AMEX), a famous financial services company in the United States, hosted a data science competition on Kaggle, requiring participants to predict whether cardholders will default in the future based on credit card billing data. Each feature has been anonymized, and AMEX provided explanations for the feature prefixes:

D_* = Variables related to defaults
S_* = Variables related to spending
P_* = Repayment information
B_* = Debt information
R_* = Risk-related variables

The following table is a demonstration of the competition data (values are fictional, for reference only):

customer_ID	S_2	P_2	...	B_2	D_41	target
000002399d6bd597023	2017-04-07	0.9366	...	0.1243	0.2824	1
0000099d6bd597052ca	2017-03-32	0.3466	...	0.5155	0.0087	0

The features 'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68' are categorical data. The goal of the competition is to predict the likelihood of each customer_ID defaulting in the future (target = 1; otherwise target = 0). The negative samples have been undersampled at a rate of 5%. The competition has already concluded, and this article summarizes the selected publicly available code solutions and discussions to learn from the excellent ideas and implementations in the community.

2 Preparation

Since the data size is large compared to the experimental environment provided by Kaggle, some work was focused on optimizing memory usage. For example, AMEX data - integer dtypes - parquet format converts floating-point data to integer type and stores the data in the parquet format, effectively reducing memory overhead. Similar data compression solutions include AMEX-Feather-Dataset.

60M sample_submission.csv
32G test_data.csv
16G train_data.csv
30M train_labels.csv

Meanwhile, the evaluation metrics for this competition are customized and incorporate the top 4% capture and gini concepts. Many solutions reference the code from Amex Competition Metric (Python) and Metric without DF to evaluate the model performance.

3. Exploratory Data Analysis

Before modeling the data, it is crucial to gain a thorough understanding of it. Exploratory Data Analysis (EDA) forms the foundation for many subsequent tasks in the AMEX competition. Kagglers' focus during this stage includes:

Checking for missing values
Identifying duplicate records
Analyzing the distribution of labels
Examining the number of credit card bills per customer and their billing dates
Exploring the distribution of categorical and numerical variables and identifying potential outliers
Investigating the correlation between features
Identifying artificial noise
Comparing feature distributions between the training and testing sets

Some highly-rated notebooks related to these analyses are:

4. Feature Engineering & Modeling

4.1 Feature Engineering

Since the data consists of credit card bills where each customer has multiple bills, merging the data from different time periods into a unified format has been a focal point for many solutions. Some approaches include:

For continuous variables, calculating statistics such as mean, standard deviation, minimum, maximum, and the value of the most recent bill for each feature across all time periods on a per-customer basis. Additionally, calculating the difference and ratio between the most recent and initial bill features.
For categorical variables, counting the occurrences of each feature value on a per-customer basis across all time periods, and considering the value of the most recent bill. Converting these features into numerical variables, followed by encoding based on the model type (or letting the model handle the encoding automatically).

Relevant high-scoring notebooks related to feature engineering include:

4.2 Model Design, Training, and Inference

The high-scoring solutions in the AMEX competition predominantly use models such as XGBoost, LightGBM, CatBoost, Transformer, TabNet, and ensembles of these models. Different Kaggle experts have employed various techniques to improve their scores. Let's discuss some high-scoring approaches.

Chris Deotte, a Kaggle Grandmaster working at Nvidia, has contributed several foundational solutions, such as XGBoost Starter, TensorFlow GRU Starter, and TensorFlow Transformer Starter. These serve as good starting points for the community. Chris's final solution, which ranked 15th out of 4875, involved using Transformer with LightGBM knowledge distillation. You can find more details in 15th Place Gold – NN Transformer using LGBM Knowledge Distillation.

Transformer using LGBM Knowledge Distillation

The approach first trained LightGBM on the original data, saving the out-of-fold (OOF) predictions from cross-validation. It also used pseudo-labeling on the test set based on these OOF predictions. The OOF predictions and test predictions were then used to train a Transformer model.
Training a Transformer model from scratch can be challenging, but with the help of knowledge distillation, the Transformer learned from the insights of LightGBM. Additionally, the large amount of test data with pseudo-labels aided the Transformer in learning. Since Transformer and LightGBM have different architectures, the attention mechanism in the Transformer learned different information, leading to better results during model fusion.
The Transformer model was then further trained on the original training set using techniques such as nested cross-validation and seed blending.
Finally, the outputs of LightGBM and Transformer were combined using a 50%/50% ensemble.

The second-place team also shared their solution in 2nd place solution - team JuneHomes. They mentioned that they would release the source code later. Along with the technical aspects, they also shared several best practices:

The team used AWS compute resources for collaboration and version management of various processes (e.g., workflow version, feature engineering version, model version).
They initially attempted manual feature engineering after removing some noise from the data, but the results were limited. Ultimately, they opted for using the statistical features mentioned earlier.
For feature selection, they relied on LightGBM's feature importance and performed iterative feature selection using Permutation Importance, validating the results with AUC from model cross-validation. The team mentioned trying other feature selection methods but found this approach to be more stable.
In terms of model selection, they experimented extensively and settled on LightGBM. They also built separate models for customers with fewer bills and performed ensemble modeling. The team found that stacking had minimal impact, and different seeds and feature engineering orders had little effect on the results.
They shared their initial project plan, documenting important points and relevant resources throughout the workflow, which helped them execute their plan smoothly and achieve their well-deserved success.

The first-place solution, 1st solution, is relatively vague as the author did not provide detailed descriptions, so we won't discuss it further here.

5 Conclusion

In this competition, many useful information and techniques emerged from the Discussion section. Therefore, this article includes some essential excerpts:

Many Kagglers also shared valuable solutions. Here are some notebooks worth exploring:

I hope this sharing proves helpful to you. Feel free to leave any further questions or discussion points in the comments!