Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism

• mechanism: statistical process
• Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days

General application scenarios: The tutorial focus on unsupervised

• Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
• Semi-supervised: Only training data with normal (or abnormal) classes are provided
• Unsupervised: No training data available

Outlier = Side product of clustering algorithm

• What clustering considered as “noise”
• Outlier detection = Extract noise set from clustering output
• problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure

Global vs local classification approaches

• Given a set of reference objects and evaluate for the outlierness of each data point
• global: assumed there is only one normal mechanism
• problem: outliers in the reference set may falsify the result
• local: no assumption on the number of normal mechanisms
• problem: how to choose a reference set

Labelling vs scoring

• considers the output of an outlier detection algorithm
• labeling: binary
• scoring: continuous output (e.g., probability for being an outlier)

Model-based approach:

• Apply a model to represent normal data points; outlier are the points that do not fit into the model
• Probabilistic tests
• Deviation-based
• Proximity-based approach: Examine the spatial proximity of each object in the data space, if an object deviates from the proximity of other objects a lot, it is considered as an outlier
• distance-based
• density-based
• Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation

# Model based

## Statistical tests

Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.

Multivariate normal density function:

$N(x) = \frac{1}{\sqrt{(2\pi)^d\lvert\Sigma\rvert}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)$
• Dimensionality of $$x$$: $$d$$
• $$\Sigma$$ is the covariance matrix from the mean $$\mu$$
• Mahalanobis distance of point $$x$$ to $$\mu$$: $$\mathrm{MDist}(x,\mu) = (x-\mu)^T\Sigma^{-1}(x-\mu)$$
• Mahalanobis distance follows a $$\chi^2$$-distribution with $$d$$ degree of freedom
• For example, outlier are those $$\mathrm{MDist}(x\mu) > \chi^2(0.975)$$ or $$3\sigma$$
• Suffer from curse of dimensionality: The higher the degree of freedom $$d$$, the more similar the value of $$\mathrm{MDist}(x,\mu)$$ for all points

## Depth-based approaches

Look for outlier at the border of data space, independent of the statistical distribution

• Find layers of convex hull to confine the data, then the points on outer layer are the outlier
• Assuming outlier are at the border of the data space
• Usually only efficient in 2D or 3D spaces

## Deviation-based approaches

Assume some general characteristics of data points, and outlier are those that do not fit them

• Example: Outlier are the points that if removed, the variance of the set is minimized
• Model of Arning et al (1996):
• Smoothing factor $$\mathrm{SF}(I)$$ that computes for each $$I\subset DB$$, the magnitude of decrease of variance of $$DB$$ when $$I$$ is removed from it
• Outlier are the elements of the exception set $$E\subset DB$$ such that $$\mathrm{SF}(E)\ge \mathrm{SF}(I)$$ for all $$I\subset DB$$
• Search space is $$O(2^n)$$ for $$n$$ data points, heuristics such as random sampling are applied

# Proximity-based

## Distance-based approaches

Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)

• Knorr and Ng (1997)
• A point $$p$$ is an outlier if at most $$\pi$$ percent of all other points have a distance to $$p$$ less than $$\epsilon$$:
$\textrm{outlier set} = \left\{p \Big\vert \frac{\lvert\{q\in DB\mid\mathrm{dist}(p,q)<\epsilon \}\rvert}{\lvert DB\rvert} \le \pi \right\}$
• Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than $$\epsilon$$ to hold points of proximity
• Application: Deriving intensional knowledge
• Find the minimal subsets of attributes in which the points are still an outlier

Outlier scoring based on $$k$$NN distances:

• $$k$$NN distance; or
• Aggregate of 1NN, 2NN, …, $$k$$NN distances
• Algorithms:
• nested loop to compute $$k$$NN
• partitioning using micro-clusters
• linearization of multi-dimensional data set using space-fill curves
• ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
• pruning: keep only top-$$n$$ outlier so far by score

## Density-based approaches

Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.

Local outlier factor (LOF)

• Breunig et al (1999)
• Motivation: Distance-based outlier detection models have problems with different densities
• multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
• solution: not compare distance with the global data set, but compare with the nearest subset
• Model:
• Reachability distance from $$o$$ to $$p$$: $$\mathrm{RD}_k(p, o) = \max(\mathrm{kdist}(o), \mathrm{dist}(p,o))$$
• $$\mathrm{kdist}(o)$$ is the distance of $$o$$ to its $$k$$-th nearest neighbor
• Local reachability distance: Inverse of average of the reachability distance of the $$k$$NNs of $$p$$:

$\mathrm{LRD}_k(p) = \left(\frac{1}{k} \sum_{o\in k\mathrm{NN}(p)}\mathrm{RD}_k(p,o)\right)^{-1}$
• Local outlier factor of $$p$$ is the average ratio of LRDs of neighbors of $$p$$ to LRD of $$p$$:

$\mathrm{LOF}_k(p) = \frac{1}{k}\sum_{o\in k\mathrm{NN}(p)}\frac{\mathrm{LRD}_k(o)}{\mathrm{LRD}_k(p)}$
• LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
• Variation: Mining for top-$$n$$ local outlier instead of all outlier
• Compress data points into micro-clusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a micro-clusters
• Then prune micro-clusters that cannot accommodate top-$$n$$ outlier, and refine micro-clusters iteratively

Influenced outlierness (INFLO)

• Jin et al (2006)
• Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
• Take symmetric neighborhood relationship into account
• Density defined as $$\mathrm{den}(p) = \dfrac{1}{\mathrm{kdist}(p)}$$
• Influenced outlierness: $$\mathrm{INFLO}_k(p) = \dfrac{\sum_{o\in kIS(p)}\mathrm{den}(o)}{\lvert kIS(p)\rvert}\frac{1}{\mathrm{den}(p)}$$
• Influence space $$kIS(p)$$ of a point $$p$$ is the union of the $$k$$NNs of $$p$$ and reverse $$k$$NNs of $$p$$ (i.e., set of points that have $$p$$ as its $$k$$NNs 1)
• Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster

Local outlier correlation integral (LOCI)

• Similar to LOF, but take $$\epsilon$$-neighborhood instead of $$k$$NNs as reference set, and test with multiple granularities to get rid of any input parameters
• Model:
• $$\epsilon$$-neighborhood: $$N(p,\epsilon) = \{q \mid \textrm{dist}(p, q) \le \epsilon\}$$
• Local density of $$p$$: Cardinality of $$N(p,\epsilon)$$
• Average density of the neighborhood: $$\mathrm{den}(p,\epsilon,\alpha) = \dfrac{\sum_{q\in N(p,\epsilon)}\lvert N(q,\alpha\epsilon)\rvert}{\lvert N(p,\epsilon)\rvert}$$
• Multi-granularity Deviation Factor (MDEF): $$\mathrm{MDEF}(p,\epsilon,\alpha) = \dfrac{\mathrm{den}(p,\epsilon,\alpha)-\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)} = 1-\dfrac{\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)}$$
• MDEF = 0 for points within the cluster and > 0 for outliers

# High-dimensional approaches

ABOD - angle-based outlier degree (Kriegel et al (2008))

• Angles are more stable than distances in high dimensional spaces (c.f. cosine-based similarity measures for text data)
• Object $$o$$ is an outlier if most other objects are located in similar directions and not an outlier if many other objects are located in varying directions
• if $$o$$ is outlier, assume that there is a cluster located at a particular angle direction of $$o$$, hence all points in the cluster are in similar direction
• Model: angle sprectrum weighted by corresponding distances

$\mathrm{ABOD}(p) = \underset{x,y\in DB}{\mathrm{Var}}\left(\frac{\langle \vec{xp},\vec{yp}\rangle}{\lVert\vec{xp}\rVert^2 \lVert\vec{yp}\rVert^2}\right)$
• $$p$$ with small ABOD is outlier

Grid-based subspace outlier detection (Aggarwal & Yu (2000))

• Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = $$\Phi$$)
• Sparsity coefficient for a $$k$$-dimensional grid cell $$C$$ that contains $$N_C$$ data points:

$S(C) = \frac{N_C - n\Phi^{-k}}{\sqrt{n\Phi^{-k}(1-\Phi^{-k})}}$
• $$k$$-dimension grid cell ($$k\le d$$) is a rectangular composition of some $$d$$-dimension grid cells
• $$S(C)<0$$ means $$N_C$$ is lower than expected
• Outliers are those points that are located in lower-dimensional cells with negative sparsity coefficient

Subspace outlier degree (SOD)

• Kriegel et al (2009)
• Outliers may be visible only in subspaces of the original data
• Compute the subspace in which the $$k$$NNs of a point $$p$$ minimize the variance, then compute the hyperplance $$H(kNN(p))$$ that is orthogonal to that subspace, and assign the distance of $$p$$ to the hyperplane as measure of outlierness

# Reference

http://www.cse.ust.hk/~leichen/courses/comp5331/

1. http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF

## Bibliographic data

@misc{
title = "Outlier Detection Techniques",
author = "Hans-Peter Kriegel and Peer Kröger and Arthur Zimek",
howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
}