CatBoost is a library for random forest. This paper describes the key feature behind it.
Sec 2
Symbols and key concepts are provided.
Dataset $D={(x_i, y_i)}_{i=1,\dots,n}$ with $x_i = (x^i_1,\dots,x^i_m)$ a vector of $m$ features and $y_i\in\mathbb{R}$ a target value. Random forest is a model $F:\mathbb{R}^m\mapsto\mathbb{R}$ to minimize expected loss $\mathbb{E}L(y, F(x))$
Gradient boosting: Builds iteratively a sequence of approximations $F^t$ in greedy fashion
- additive: $F^t = F^{t-1} + \alpha h^t$ with step size $\alpha$
-
$h^t:\mathbb{R}^m\mapsto\mathbb{R}$ is the base predictor chosen from a family of functions $H$, such that
\[h^t = \arg\min_{h\in H}\mathbb{E}L(y,F^{t-1}(x)+h(x))\] - optimization using Newton’s method, in particular, least-square approximation
Sec 3
The problem of the gradient boosting tree is when we encountered categorical features.
Handling categorical features in a boosted tree is usually to use one-hot encoding. But if the feature has high cardinality, e.g., user ID, this would lead to an infeasibly large number of new features.
Other method: Group categories by the target statistics (TS)
- TS = expected value of the target within that group of categories
- LightGBM approach: categorical features become gradient statistics at each step of gradient boosting
- providing information for the tree
- high computation cost to calculate statistics for each categorical value at each step
- high memory cost to store the categories of each node for each split
- using TS as a new numerical feature is cost-efficient with minimal information loss
How to substitute category $x^i_k$ of sample $k$ with a numeric feature $\hat{x}^i_k$ of TS?
- make $\hat{x}^i_k = \mathbb{E}[y\mid x^i = x^i_k]$, i.e., the expectation of target of the same category over the entire population
-
Greedy approach to estimate $\mathbb{E}[y\mid x^i = x^i_k]$ on low-frequency categories:
\[\hat{x}^i_k = \frac{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k] + a}\]- $p$ the prior estimate, usually the average target value of the dataset
- $a>0$ the smoothing parameter
- subject to target leaking, i.e., there is a conditional shift from the desired expectation to the estimated expectation \(\mathbb{E}[\hat{x}^i\mid y=v] \ne \mathbb{E}[\hat{x}^i_k\mid y_k = v]\)
Improvement to mitigate conditional shift: Compute target statistics from a special subset
\[\hat{x}^i_k = \frac{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k] + a}\]- holdout TS: Partition the training dataset into two parts, use one part to calculate the TS and another for actual training. But this made the size of the training set smaller
- Leave-one-out TS: Use $D_k = D\setminus x_k$ for TS
- Ordered TS (CatBoost):
- inspired by online learning algorithms (getting samples sequentially in time)
- in offline setting, random permute the samples $\sigma(j)$ as artificial “time”
- to compute TS at $x_k$, consider the subset $D_k = {x_j: \sigma(j)<\sigma(k)}$ of samples seen before in the permuted sequence
- CatBoost use different permutation in different gradient boosting steps (i.e., order boosting), to reduce the variance
Sec 4
$h^t$ is usually approximated from the dataset, \(h^t = \arg\min_{h\in H}\frac{1}{n}\sum_{k=1}^n(-g^t(x_k,y_k)-h(x_k))^2\)
- base predictor $h^t$ is biased since the distribution of $g^t(x_k,y_k)\mid x_k\ne$ that of $g^t(x,y)\mid x$
- causing target leakage: gradients used at each step are estimated using the target values of the same data points the current model $F^{t-1}$ was built on
Prediction shift: Theorem 1 in the paper said, unbiased estimate can be achieved if independent datasets are used at each gradient step. Otherwise there is a bias of $-\frac{1}{n=1}c_2(x^2-\frac12)$ for $n$ the dataset size
Bibliographic data
@inproceedings{
title = "CatBoost: Unbiased boosting with categorical features",
author = "Liudmila Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin",
booktitle = "Proc. NeurIPS",
year = "2018",
arXiv = "1706.09516",
}