CatBoost is a library for random forest. This paper describes the key feature behind it.
Sec 2
Symbols and key concepts are provided.
Dataset $D={(x_i, y_i)}_{i=1,\dots,n}$ with $x_i = (x^i_1,\dots,x^i_m)$ a vector of $m$ features and $y_i\in\mathbb{R}$ a target value. Random forest is a model $F:\mathbb{R}^m\mapsto\mathbb{R}$ to minimize expected loss $\mathbb{E}L(y, F(x))$
Gradient boosting: Builds iteratively a sequence of approximations $F^t$ in greedy fashion
 additive: $F^t = F^{t1} + \alpha h^t$ with step size $\alpha$

$h^t:\mathbb{R}^m\mapsto\mathbb{R}$ is the base predictor chosen from a family of functions $H$, such that
\[h^t = \arg\min_{h\in H}\mathbb{E}L(y,F^{t1}(x)+h(x))\]  optimization using Newton’s method, in particular, leastsquare approximation
Sec 3
The problem of the gradient boosting tree is when we encountered categorical features.
Handling categorical features in a boosted tree is usually to use onehot encoding. But if the feature has high cardinality, e.g., user ID, this would lead to an infeasibly large number of new features.
Other method: Group categories by the target statistics (TS)
 TS = expected value of the target within that group of categories
 LightGBM approach: categorical features become gradient statistics at each step of gradient boosting
 providing information for the tree
 high computation cost to calculate statistics for each categorical value at each step
 high memory cost to store the categories of each node for each split
 using TS as a new numerical feature is costefficient with minimal information loss
How to substitute category $x^i_k$ of sample $k$ with a numeric feature $\hat{x}^i_k$ of TS?
 make $\hat{x}^i_k = \mathbb{E}[y\mid x^i = x^i_k]$, i.e., the expectation of target of the same category over the entire population

Greedy approach to estimate $\mathbb{E}[y\mid x^i = x^i_k]$ on lowfrequency categories:
\[\hat{x}^i_k = \frac{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k] + a}\] $p$ the prior estimate, usually the average target value of the dataset
 $a>0$ the smoothing parameter
 subject to target leaking, i.e., there is a conditional shift from the desired expectation to the estimated expectation \(\mathbb{E}[\hat{x}^i\mid y=v] \ne \mathbb{E}[\hat{x}^i_k\mid y_k = v]\)
Improvement to mitigate conditional shift: Compute target statistics from a special subset
\[\hat{x}^i_k = \frac{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k] + a}\] holdout TS: Partition the training dataset into two parts, use one part to calculate the TS and another for actual training. But this made the size of the training set smaller
 Leaveoneout TS: Use $D_k = D\setminus x_k$ for TS
 Ordered TS (CatBoost):
 inspired by online learning algorithms (getting samples sequentially in time)
 in offline setting, random permute the samples $\sigma(j)$ as artificial “time”
 to compute TS at $x_k$, consider the subset $D_k = {x_j: \sigma(j)<\sigma(k)}$ of samples seen before in the permuted sequence
 CatBoost use different permutation in different gradient boosting steps (i.e., order boosting), to reduce the variance
Sec 4
$h^t$ is usually approximated from the dataset, \(h^t = \arg\min_{h\in H}\frac{1}{n}\sum_{k=1}^n(g^t(x_k,y_k)h(x_k))^2\)
 base predictor $h^t$ is biased since the distribution of $g^t(x_k,y_k)\mid x_k\ne$ that of $g^t(x,y)\mid x$
 causing target leakage: gradients used at each step are estimated using the target values of the same data points the current model $F^{t1}$ was built on
Prediction shift: Theorem 1 in the paper said, unbiased estimate can be achieved if independent datasets are used at each gradient step. Otherwise there is a bias of $\frac{1}{n=1}c_2(x^2\frac12)$ for $n$ the dataset size
Bibliographic data
@inproceedings{
title = "CatBoost: Unbiased boosting with categorical features",
author = "Liudmila Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin",
booktitle = "Proc. NeurIPS",
year = "2018",
arXiv = "1706.09516",
}