Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism

• mechanism: statistical process
• Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days

General application scenarios: The tutorial focus on unsupervised

• Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
• Semi-supervised: Only training data with normal (or abnormal) classes are provided
• Unsupervised: No training data available

Outlier = Side product of clustering algorithm

• What clustering considered as “noise”
• Outlier detection = Extract noise set from clustering output
• problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure

Global vs local classification approaches

• Given a set of reference objects and evaluate for the outlierness of each data point
• global: assumed there is only one normal mechanism
• problem: outliers in the reference set may falsify the result
• local: no assumption on the number of normal mechanisms
• problem: how to choose a reference set

Labelling vs scoring

• considers the output of an outlier detection algorithm
• labeling: binary
• scoring: continuous output (e.g., probability for being an outlier)

Model-based approach:

• Apply a model to represent normal data points; outlier are the points that do not fit into the model
• Probabilistic tests
• Deviation-based
• Proximity-based approach: Examine the spatial proximity of each object in the data space, if an object deviates from the proximity of other objects a lot, it is considered as an outlier
• distance-based
• density-based
• Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation

# Model based

## Statistical tests

Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.

Multivariate normal density function:

• Dimensionality of $x$: $d$
• $\Sigma$ is the covariance matrix from the mean $\mu$
• Mahalanobis distance of point $x$ to $\mu$: $\mathrm{MDist}(x,\mu) = (x-\mu)^T\Sigma^{-1}(x-\mu)$
• Mahalanobis distance follows a $\chi^2$-distribution with $d$ degree of freedom
• For example, outlier are those $\mathrm{MDist}(x\mu) > \chi^2(0.975)$ or $3\sigma$
• Suffer from curse of dimensionality: The higher the degree of freedom $d$, the more similar the value of $\mathrm{MDist}(x,\mu)$ for all points

## Depth-based approaches

Look for outlier at the border of data space, independent of the statistical distribution

• Find layers of convex hull to confine the data, then the points on outer layer are the outlier
• Assuming outlier are at the border of the data space
• Usually only efficient in 2D or 3D spaces

## Deviation-based approaches

Assume some general characteristics of data points, and outlier are those that do not fit them

• Example: Outlier are the points that if removed, the variance of the set is minimized
• Model of Arning et al (1996):
• Smoothing factor $\mathrm{SF}(I)$ that computes for each $I\subset DB$, the magnitude of decrease of variance of $DB$ when $I$ is removed from it
• Outlier are the elements of the exception set $E\subset DB$ such that $\mathrm{SF}(E)\ge \mathrm{SF}(I)$ for all $I\subset DB$
• Search space is $O(2^n)$ for $n$ data points, heuristics such as random sampling are applied

# Proximity-based

## Distance-based approaches

Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)

• Knorr and Ng (1997)
• A point $p$ is an outlier if at most $\pi$ percent of all other points have a distance to $p$ less than $\epsilon$:
• Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than $\epsilon$ to hold points of proximity
• Application: Deriving intensional knowledge
• Find the minimal subsets of attributes in which the points are still an outlier

Outlier scoring based on $k$NN distances:

• $k$NN distance; or
• Aggregate of 1NN, 2NN, …, $k$NN distances
• Algorithms:
• nested loop to compute $k$NN
• partitioning using micro-clusters
• linearization of multi-dimensional data set using space-fill curves
• ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
• pruning: keep only top-$n$ outlier so far by score

## Density-based approaches

Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.

Local outlier factor (LOF)

• Breunig et al (1999)
• Motivation: Distance-based outlier detection models have problems with different densities
• multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
• solution: not compare distance with the global data set, but compare with the nearest subset
• Model:
• Reachability distance from $o$ to $p$: $\mathrm{RD}_k(p, o) = \max(\mathrm{kdist}(o), \mathrm{dist}(p,o))$
• $\mathrm{kdist}(o)$ is the distance of $o$ to its $k$-th nearest neighbor
• Local reachability distance: Inverse of average of the reachability distance of the $k$NNs of $p$:

• Local outlier factor of $p$ is the average ratio of LRDs of neighbors of $p$ to LRD of $p$:

• LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
• Variation: Mining for top-$n$ local outlier instead of all outlier
• Compress data points into micro-clusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a micro-clusters
• Then prune micro-clusters that cannot accommodate top-$n$ outlier, and refine micro-clusters iteratively

Influenced outlierness (INFLO)

• Jin et al (2006)
• Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
• Take symmetric neighborhood relationship into account
• Density defined as $\mathrm{den}(p) = \dfrac{1}{\mathrm{kdist}(p)}$
• Influenced outlierness: $\mathrm{INFLO}_k(p) = \dfrac{\sum_{o\in kIS(p)}\mathrm{den}(o)}{\lvert kIS(p)\rvert}\frac{1}{\mathrm{den}(p)}$
• Influence space $kIS(p)$ of a point $p$ is the union of the $k$NNs of $p$ and reverse $k$NNs of $p$ (i.e., set of points that have $p$ as its $k$NNs 1)
• Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster

Local outlier correlation integral (LOCI)

• Papadimitriou et al (2003)
• Similar to LOF, but take $\epsilon$-neighborhood instead of $k$NNs as reference set, and test with multiple granularities to get rid of any input parameters
• Model:
• $\epsilon$-neighborhood: $N(p,\epsilon) = \{q \mid \textrm{dist}(p, q) \le \epsilon\}$
• Local density of $p$: Cardinality of $N(p,\epsilon)$
• Average density of the neighborhood: $\mathrm{den}(p,\epsilon,\alpha) = \dfrac{\sum_{q\in N(p,\epsilon)}\lvert N(q,\alpha\epsilon)\rvert}{\lvert N(p,\epsilon)\rvert}$
• Multi-granularity Deviation Factor (MDEF): $\mathrm{MDEF}(p,\epsilon,\alpha) = \dfrac{\mathrm{den}(p,\epsilon,\alpha)-\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)} = 1-\dfrac{\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)}$
• MDEF = 0 for points within the cluster and > 0 for outliers

# High-dimensional approaches

ABOD - angle-based outlier degree (Kriegel et al (2008))

• Angles are more stable than distances in high dimensional spaces (c.f. cosine-based similarity measures for text data)
• Object $o$ is an outlier if most other objects are located in similar directions and not an outlier if many other objects are located in varying directions
• if $o$ is outlier, assume that there is a cluster located at a particular angle direction of $o$, hence all points in the cluster are in similar direction
• Model: angle sprectrum weighted by corresponding distances

• $p$ with small ABOD is outlier

Grid-based subspace outlier detection (Aggarwal & Yu (2000))

• Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = $\Phi$)
• Sparsity coefficient for a $k$-dimensional grid cell $C$ that contains $N_C$ data points:

• $k$-dimension grid cell ($k\le d$) is a rectangular composition of some $d$-dimension grid cells
• $% $ means $N_C$ is lower than expected
• Outliers are those points that are located in lower-dimensional cells with negative sparsity coefficient

Subspace outlier degree (SOD)

• Kriegel et al (2009)
• Outliers may be visible only in subspaces of the original data
• Compute the subspace in which the $k$NNs of a point $p$ minimize the variance, then compute the hyperplance $H(kNN(p))$ that is orthogonal to that subspace, and assign the distance of $p$ to the hyperplane as measure of outlierness

# Reference

http://www.cse.ust.hk/~leichen/courses/comp5331/

1. http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF

## Bibliographic data

@misc{
title = "Outlier Detection Techniques",
author = "Hans-Peter Kriegel and Peer Kröger and Arthur Zimek",
howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
address = "Washington DC",
year = "2010",
url = "http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdf",
}