Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism
 mechanism: statistical process
 Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days
General application scenarios: The tutorial focus on unsupervised
 Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
 Semisupervised: Only training data with normal (or abnormal) classes are provided
 Unsupervised: No training data available
Outlier = Side product of clustering algorithm
 What clustering considered as “noise”
 Outlier detection = Extract noise set from clustering output
 problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure
Global vs local classification approaches
 Given a set of reference objects and evaluate for the outlierness of each data point
 global: assumed there is only one normal mechanism
 problem: outliers in the reference set may falsify the result
 local: no assumption on the number of normal mechanisms
 problem: how to choose a reference set
Labelling vs scoring
 considers the output of an outlier detection algorithm
 labeling: binary
 scoring: continuous output (e.g., probability for being an outlier)
Modelbased approach:
 Apply a model to represent normal data points; outlier are the points that do not fit into the model
 Probabilistic tests
 Deviationbased
 Proximitybased approach: Examine the spatial proximity of each object in the
data space, if an object deviates from the proximity of other objects a lot,
it is considered as an outlier
 distancebased
 densitybased
 Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation
Model based
Statistical tests
Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.
Multivariate normal density function:
\[N(x) = \frac{1}{\sqrt{(2\pi)^d\lvert\Sigma\rvert}}\exp\left(\frac{1}{2}(x\mu)^T\Sigma^{1}(x\mu)\right)\] Dimensionality of \(x\): \(d\)
 \(\Sigma\) is the covariance matrix from the mean \(\mu\)
 Mahalanobis distance of point \(x\) to \(\mu\): \(\mathrm{MDist}(x,\mu) = (x\mu)^T\Sigma^{1}(x\mu)\)
 Mahalanobis distance follows a \(\chi^2\)distribution with \(d\) degree of freedom
 For example, outlier are those \(\mathrm{MDist}(x\mu) > \chi^2(0.975)\) or \(3\sigma\)
 Suffer from curse of dimensionality: The higher the degree of freedom \(d\), the more similar the value of \(\mathrm{MDist}(x,\mu)\) for all points
Depthbased approaches
Look for outlier at the border of data space, independent of the statistical distribution
 Find layers of convex hull to confine the data, then the points on outer layer are the outlier
 Assuming outlier are at the border of the data space
 Usually only efficient in 2D or 3D spaces
Deviationbased approaches
Assume some general characteristics of data points, and outlier are those that do not fit them
 Example: Outlier are the points that if removed, the variance of the set is minimized
 Model of Arning et al (1996):
 Smoothing factor \(\mathrm{SF}(I)\) that computes for each \(I\subset DB\), the magnitude of decrease of variance of \(DB\) when \(I\) is removed from it
 Outlier are the elements of the exception set \(E\subset DB\) such that \(\mathrm{SF}(E)\ge \mathrm{SF}(I)\) for all \(I\subset DB\)
 Search space is \(O(2^n)\) for \(n\) data points, heuristics such as random sampling are applied
Proximitybased
Distancebased approaches
Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)
 Knorr and Ng (1997)
 A point \(p\) is an outlier if at most \(\pi\) percent of all other points have a distance to \(p\) less than \(\epsilon\):
 Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than \(\epsilon\) to hold points of proximity
 Application: Deriving intensional knowledge
 Find the minimal subsets of attributes in which the points are still an outlier
Outlier scoring based on \(k\)NN distances:
 \(k\)NN distance; or
 Aggregate of 1NN, 2NN, …, \(k\)NN distances
 Algorithms:
 nested loop to compute \(k\)NN
 partitioning using microclusters
 linearization of multidimensional data set using spacefill curves
 ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
 pruning: keep only top\(n\) outlier so far by score
Densitybased approaches
Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.
Local outlier factor (LOF)
 Breunig et al (1999)
 Motivation: Distancebased outlier detection models have problems with
different densities
 multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
 solution: not compare distance with the global data set, but compare with the nearest subset
 Model:
 Reachability distance from \(o\) to \(p\): \(\mathrm{RD}_k(p, o) = \max(\mathrm{kdist}(o), \mathrm{dist}(p,o))\)
 \(\mathrm{kdist}(o)\) is the distance of \(o\) to its \(k\)th nearest neighbor

Local reachability distance: Inverse of average of the reachability distance of the \(k\)NNs of \(p\):
\[\mathrm{LRD}_k(p) = \left(\frac{1}{k} \sum_{o\in k\mathrm{NN}(p)}\mathrm{RD}_k(p,o)\right)^{1}\] 
Local outlier factor of \(p\) is the average ratio of LRDs of neighbors of \(p\) to LRD of \(p\):
\[\mathrm{LOF}_k(p) = \frac{1}{k}\sum_{o\in k\mathrm{NN}(p)}\frac{\mathrm{LRD}_k(o)}{\mathrm{LRD}_k(p)}\]  LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
 Reachability distance from \(o\) to \(p\): \(\mathrm{RD}_k(p, o) = \max(\mathrm{kdist}(o), \mathrm{dist}(p,o))\)
 Variation: Mining for top\(n\) local outlier instead of all outlier
 Compress data points into microclusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a microclusters
 Then prune microclusters that cannot accommodate top\(n\) outlier, and refine microclusters iteratively
Influenced outlierness (INFLO)
 Jin et al (2006)
 Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
 Take symmetric neighborhood relationship into account
 Density defined as \(\mathrm{den}(p) = \dfrac{1}{\mathrm{kdist}(p)}\)
 Influenced outlierness:
\(\mathrm{INFLO}_k(p) = \dfrac{\sum_{o\in kIS(p)}\mathrm{den}(o)}{\lvert kIS(p)\rvert}\frac{1}{\mathrm{den}(p)}\)
 Influence space \(kIS(p)\) of a point \(p\) is the union of the \(k\)NNs of \(p\) and reverse \(k\)NNs of \(p\) (i.e., set of points that have \(p\) as its \(k\)NNs ^{1})
 Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster
Local outlier correlation integral (LOCI)
 Papadimitriou et al (2003)
 Similar to LOF, but take \(\epsilon\)neighborhood instead of \(k\)NNs as reference set, and test with multiple granularities to get rid of any input parameters
 Model:
 \(\epsilon\)neighborhood: \(N(p,\epsilon) = \{q \mid \textrm{dist}(p, q) \le \epsilon\}\)
 Local density of \(p\): Cardinality of \(N(p,\epsilon)\)
 Average density of the neighborhood: \(\mathrm{den}(p,\epsilon,\alpha) = \dfrac{\sum_{q\in N(p,\epsilon)}\lvert N(q,\alpha\epsilon)\rvert}{\lvert N(p,\epsilon)\rvert}\)
 Multigranularity Deviation Factor (MDEF):
\(\mathrm{MDEF}(p,\epsilon,\alpha) = \dfrac{\mathrm{den}(p,\epsilon,\alpha)\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)} = 1\dfrac{\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)}\)
 MDEF = 0 for points within the cluster and > 0 for outliers
Highdimensional approaches
ABOD  anglebased outlier degree (Kriegel et al (2008))
 Angles are more stable than distances in high dimensional spaces (c.f. cosinebased similarity measures for text data)
 Object \(o\) is an outlier if most other objects are located in similar
directions and not an outlier if many other objects are located in varying
directions
 if \(o\) is outlier, assume that there is a cluster located at a particular angle direction of \(o\), hence all points in the cluster are in similar direction

Model: angle sprectrum weighted by corresponding distances
\[\mathrm{ABOD}(p) = \underset{x,y\in DB}{\mathrm{Var}}\left(\frac{\langle \vec{xp},\vec{yp}\rangle}{\lVert\vec{xp}\rVert^2 \lVert\vec{yp}\rVert^2}\right)\] \(p\) with small ABOD is outlier
Gridbased subspace outlier detection (Aggarwal & Yu (2000))
 Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = \(\Phi\))

Sparsity coefficient for a \(k\)dimensional grid cell \(C\) that contains \(N_C\) data points:
\[S(C) = \frac{N_C  n\Phi^{k}}{\sqrt{n\Phi^{k}(1\Phi^{k})}}\] \(k\)dimension grid cell (\(k\le d\)) is a rectangular composition of some \(d\)dimension grid cells
 \(S(C)<0\) means \(N_C\) is lower than expected
 Outliers are those points that are located in lowerdimensional cells with negative sparsity coefficient
Subspace outlier degree (SOD)
 Kriegel et al (2009)
 Outliers may be visible only in subspaces of the original data
 Compute the subspace in which the \(k\)NNs of a point \(p\) minimize the variance, then compute the hyperplance \(H(kNN(p))\) that is orthogonal to that subspace, and assign the distance of \(p\) to the hyperplane as measure of outlierness
Reference
http://www.cse.ust.hk/~leichen/courses/comp5331/

http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF ↩
Bibliographic data
@misc{
title = "Outlier Detection Techniques",
author = "HansPeter Kriegel and Peer Kröger and Arthur Zimek",
howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
address = "Washington DC",
year = "2010",
url = "http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10outliertutorial.pdf",
}