CatBoost is a library for random forest. This paper describes the key feature behind it.

## Sec 2

Symbols and key concepts are provided.

Dataset $D={(x_i, y_i)}_{i=1,\dots,n}$ with $x_i = (x^i_1,\dots,x^i_m)$ a vector of $m$ features and $y_i\in\mathbb{R}$ a target value. Random forest is a model $F:\mathbb{R}^m\mapsto\mathbb{R}$ to minimize expected loss $\mathbb{E}L(y, F(x))$

Gradient boosting: Builds iteratively a sequence of approximations $F^t$ in greedy fashion

• additive: $F^t = F^{t-1} + \alpha h^t$ with step size $\alpha$
• $h^t:\mathbb{R}^m\mapsto\mathbb{R}$ is the base predictor chosen from a family of functions $H$, such that

$h^t = \arg\min_{h\in H}\mathbb{E}L(y,F^{t-1}(x)+h(x))$
• optimization using Newton’s method, in particular, least-square approximation
\begin{aligned} h^t &= \arg\min_{h\in H}\mathbb{E}(-g^t(x,y)-h(x))^2 \text{with}\quad g^t(x,y) &= \left.\frac{\partial L(y,s)}{\partial s}\right\vert_{s=F^{t-1}(x)} \end{aligned}

## Sec 3

The problem of the gradient boosting tree is when we encountered categorical features.

Handling categorical features in a boosted tree is usually to use one-hot encoding. But if the feature has high cardinality, e.g., user ID, this would lead to an infeasibly large number of new features.

Other method: Group categories by the target statistics (TS)

• TS = expected value of the target within that group of categories
• LightGBM approach: categorical features become gradient statistics at each step of gradient boosting
• providing information for the tree
• high computation cost to calculate statistics for each categorical value at each step
• high memory cost to store the categories of each node for each split
• using TS as a new numerical feature is cost-efficient with minimal information loss

How to substitute category $x^i_k$ of sample $k$ with a numeric feature $\hat{x}^i_k$ of TS?

• make $\hat{x}^i_k = \mathbb{E}[y\mid x^i = x^i_k]$, i.e., the expectation of target of the same category over the entire population
• Greedy approach to estimate $\mathbb{E}[y\mid x^i = x^i_k]$ on low-frequency categories:

$\hat{x}^i_k = \frac{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k] + a}$
• $p$ the prior estimate, usually the average target value of the dataset
• $a>0$ the smoothing parameter
• subject to target leaking, i.e., there is a conditional shift from the desired expectation to the estimated expectation $$\mathbb{E}[\hat{x}^i\mid y=v] \ne \mathbb{E}[\hat{x}^i_k\mid y_k = v]$$

Improvement to mitigate conditional shift: Compute target statistics from a special subset

$\hat{x}^i_k = \frac{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k] + a}$
• holdout TS: Partition the training dataset into two parts, use one part to calculate the TS and another for actual training. But this made the size of the training set smaller
• Leave-one-out TS: Use $D_k = D\setminus x_k$ for TS
• Ordered TS (CatBoost):
• inspired by online learning algorithms (getting samples sequentially in time)
• in offline setting, random permute the samples $\sigma(j)$ as artificial “time”
• to compute TS at $x_k$, consider the subset $D_k = {x_j: \sigma(j)<\sigma(k)}$ of samples seen before in the permuted sequence
• CatBoost use different permutation in different gradient boosting steps (i.e., order boosting), to reduce the variance

## Sec 4

$h^t$ is usually approximated from the dataset, $$h^t = \arg\min_{h\in H}\frac{1}{n}\sum_{k=1}^n(-g^t(x_k,y_k)-h(x_k))^2$$

• base predictor $h^t$ is biased since the distribution of $g^t(x_k,y_k)\mid x_k\ne$ that of $g^t(x,y)\mid x$
• causing target leakage: gradients used at each step are estimated using the target values of the same data points the current model $F^{t-1}$ was built on

Prediction shift: Theorem 1 in the paper said, unbiased estimate can be achieved if independent datasets are used at each gradient step. Otherwise there is a bias of $-\frac{1}{n=1}c_2(x^2-\frac12)$ for $n$ the dataset size

## Bibliographic data

@inproceedings{
title = "CatBoost: Unbiased boosting with categorical features",
author = "Liudmila Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin",
booktitle = "Proc. NeurIPS",
year = "2018",
arXiv = "1706.09516",
}