CatBoost is a library for random forest. This paper describes the key feature behind it.

Sec 2

Symbols and key concepts are provided.

Dataset $D={(x_i, y_i)}_{i=1,\dots,n}$ with $x_i = (x^i_1,\dots,x^i_m)$ a vector of $m$ features and $y_i\in\mathbb{R}$ a target value. Random forest is a model $F:\mathbb{R}^m\mapsto\mathbb{R}$ to minimize expected loss $\mathbb{E}L(y, F(x))$

Gradient boosting: Builds iteratively a sequence of approximations $F^t$ in greedy fashion

  • additive: $F^t = F^{t-1} + \alpha h^t$ with step size $\alpha$
  • $h^t:\mathbb{R}^m\mapsto\mathbb{R}$ is the base predictor chosen from a family of functions $H$, such that

    \[h^t = \arg\min_{h\in H}\mathbb{E}L(y,F^{t-1}(x)+h(x))\]
  • optimization using Newton’s method, in particular, least-square approximation
\[\begin{aligned} h^t &= \arg\min_{h\in H}\mathbb{E}(-g^t(x,y)-h(x))^2 \text{with}\quad g^t(x,y) &= \left.\frac{\partial L(y,s)}{\partial s}\right\vert_{s=F^{t-1}(x)} \end{aligned}\]

Sec 3

The problem of the gradient boosting tree is when we encountered categorical features.

Handling categorical features in a boosted tree is usually to use one-hot encoding. But if the feature has high cardinality, e.g., user ID, this would lead to an infeasibly large number of new features.

Other method: Group categories by the target statistics (TS)

  • TS = expected value of the target within that group of categories
  • LightGBM approach: categorical features become gradient statistics at each step of gradient boosting
    • providing information for the tree
    • high computation cost to calculate statistics for each categorical value at each step
    • high memory cost to store the categories of each node for each split
  • using TS as a new numerical feature is cost-efficient with minimal information loss

How to substitute category $x^i_k$ of sample $k$ with a numeric feature $\hat{x}^i_k$ of TS?

  • make $\hat{x}^i_k = \mathbb{E}[y\mid x^i = x^i_k]$, i.e., the expectation of target of the same category over the entire population
  • Greedy approach to estimate $\mathbb{E}[y\mid x^i = x^i_k]$ on low-frequency categories:

    \[\hat{x}^i_k = \frac{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j=1}^n \mathbb{1}[x^i_j=x^i_k] + a}\]
    • $p$ the prior estimate, usually the average target value of the dataset
    • $a>0$ the smoothing parameter
    • subject to target leaking, i.e., there is a conditional shift from the desired expectation to the estimated expectation \(\mathbb{E}[\hat{x}^i\mid y=v] \ne \mathbb{E}[\hat{x}^i_k\mid y_k = v]\)

Improvement to mitigate conditional shift: Compute target statistics from a special subset

\[\hat{x}^i_k = \frac{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k]y_j + ap}{\sum_{j\in J} \mathbb{1}[x^i_j=x^i_k] + a}\]
  • holdout TS: Partition the training dataset into two parts, use one part to calculate the TS and another for actual training. But this made the size of the training set smaller
  • Leave-one-out TS: Use $D_k = D\setminus x_k$ for TS
  • Ordered TS (CatBoost):
    • inspired by online learning algorithms (getting samples sequentially in time)
    • in offline setting, random permute the samples $\sigma(j)$ as artificial “time”
    • to compute TS at $x_k$, consider the subset $D_k = {x_j: \sigma(j)<\sigma(k)}$ of samples seen before in the permuted sequence
    • CatBoost use different permutation in different gradient boosting steps (i.e., order boosting), to reduce the variance

Sec 4

$h^t$ is usually approximated from the dataset, \(h^t = \arg\min_{h\in H}\frac{1}{n}\sum_{k=1}^n(-g^t(x_k,y_k)-h(x_k))^2\)

  • base predictor $h^t$ is biased since the distribution of $g^t(x_k,y_k)\mid x_k\ne$ that of $g^t(x,y)\mid x$
  • causing target leakage: gradients used at each step are estimated using the target values of the same data points the current model $F^{t-1}$ was built on

Prediction shift: Theorem 1 in the paper said, unbiased estimate can be achieved if independent datasets are used at each gradient step. Otherwise there is a bias of $-\frac{1}{n=1}c_2(x^2-\frac12)$ for $n$ the dataset size

Bibliographic data

@inproceedings{
   title = "CatBoost: Unbiased boosting with categorical features",
   author = "Liudmila Prokhorenkova and Gleb Gusev and Aleksandr Vorobev and Anna Veronika Dorogush and Andrey Gulin",
   booktitle = "Proc. NeurIPS",
   year = "2018",
   arXiv = "1706.09516",
}