Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism

  • mechanism: statistical process
  • Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days

General application scenarios: The tutorial focus on unsupervised

  • Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
  • Semi-supervised: Only training data with normal (or abnormal) classes are provided
  • Unsupervised: No training data available

Outlier = Side product of clustering algorithm

  • What clustering considered as “noise”
  • Outlier detection = Extract noise set from clustering output
    • problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure

Global vs local classification approaches

  • Given a set of reference objects and evaluate for the outlierness of each data point
  • global: assumed there is only one normal mechanism
    • problem: outliers in the reference set may falsify the result
  • local: no assumption on the number of normal mechanisms
    • problem: how to choose a reference set

Labelling vs scoring

  • considers the output of an outlier detection algorithm
  • labeling: binary
  • scoring: continuous output (e.g., probability for being an outlier)

Model-based approach:

  • Apply a model to represent normal data points; outlier are the points that do not fit into the model
    • Probabilistic tests
    • Deviation-based
  • Proximity-based approach: Examine the spatial proximity of each object in the data space, if an object deviates from the proximity of other objects a lot, it is considered as an outlier
    • distance-based
    • density-based
  • Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation

Model based

Statistical tests

Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.

Multivariate normal density function:

\[N(x) = \frac{1}{\sqrt{(2\pi)^d\lvert\Sigma\rvert}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)\]
  • Dimensionality of \(x\): \(d\)
  • \(\Sigma\) is the covariance matrix from the mean \(\mu\)
  • Mahalanobis distance of point \(x\) to \(\mu\): \(\mathrm{MDist}(x,\mu) = (x-\mu)^T\Sigma^{-1}(x-\mu)\)
    • Mahalanobis distance follows a \(\chi^2\)-distribution with \(d\) degree of freedom
    • For example, outlier are those \(\mathrm{MDist}(x\mu) > \chi^2(0.975)\) or \(3\sigma\)
  • Suffer from curse of dimensionality: The higher the degree of freedom \(d\), the more similar the value of \(\mathrm{MDist}(x,\mu)\) for all points

Depth-based approaches

Look for outlier at the border of data space, independent of the statistical distribution

  • Find layers of convex hull to confine the data, then the points on outer layer are the outlier
  • Assuming outlier are at the border of the data space
  • Usually only efficient in 2D or 3D spaces

Deviation-based approaches

Assume some general characteristics of data points, and outlier are those that do not fit them

  • Example: Outlier are the points that if removed, the variance of the set is minimized
  • Model of Arning et al (1996):
    • Smoothing factor \(\mathrm{SF}(I)\) that computes for each \(I\subset DB\), the magnitude of decrease of variance of \(DB\) when \(I\) is removed from it
    • Outlier are the elements of the exception set \(E\subset DB\) such that \(\mathrm{SF}(E)\ge \mathrm{SF}(I)\) for all \(I\subset DB\)
    • Search space is \(O(2^n)\) for \(n\) data points, heuristics such as random sampling are applied

Proximity-based

Distance-based approaches

Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)

  • Knorr and Ng (1997)
  • A point \(p\) is an outlier if at most \(\pi\) percent of all other points have a distance to \(p\) less than \(\epsilon\):
\[\textrm{outlier set} = \left\{p \Big\vert \frac{\lvert\{q\in DB\mid\mathrm{dist}(p,q)<\epsilon \}\rvert}{\lvert DB\rvert} \le \pi \right\}\]
  • Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than \(\epsilon\) to hold points of proximity
  • Application: Deriving intensional knowledge
    • Find the minimal subsets of attributes in which the points are still an outlier

Outlier scoring based on \(k\)NN distances:

  • \(k\)NN distance; or
  • Aggregate of 1NN, 2NN, …, \(k\)NN distances
  • Algorithms:
    • nested loop to compute \(k\)NN
    • partitioning using micro-clusters
    • linearization of multi-dimensional data set using space-fill curves
    • ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
      • pruning: keep only top-\(n\) outlier so far by score

Density-based approaches

Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.

Local outlier factor (LOF)

  • Breunig et al (1999)
  • Motivation: Distance-based outlier detection models have problems with different densities
    • multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
    • solution: not compare distance with the global data set, but compare with the nearest subset
  • Model:
    • Reachability distance from \(o\) to \(p\): \(\mathrm{RD}_k(p, o) = \max(\mathrm{kdist}(o), \mathrm{dist}(p,o))\)
      • \(\mathrm{kdist}(o)\) is the distance of \(o\) to its \(k\)-th nearest neighbor
    • Local reachability distance: Inverse of average of the reachability distance of the \(k\)NNs of \(p\):

      \[\mathrm{LRD}_k(p) = \left(\frac{1}{k} \sum_{o\in k\mathrm{NN}(p)}\mathrm{RD}_k(p,o)\right)^{-1}\]
    • Local outlier factor of \(p\) is the average ratio of LRDs of neighbors of \(p\) to LRD of \(p\):

      \[\mathrm{LOF}_k(p) = \frac{1}{k}\sum_{o\in k\mathrm{NN}(p)}\frac{\mathrm{LRD}_k(o)}{\mathrm{LRD}_k(p)}\]
    • LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
  • Variation: Mining for top-\(n\) local outlier instead of all outlier
    • Compress data points into micro-clusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a micro-clusters
    • Then prune micro-clusters that cannot accommodate top-\(n\) outlier, and refine micro-clusters iteratively

Influenced outlierness (INFLO)

  • Jin et al (2006)
  • Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
  • Take symmetric neighborhood relationship into account
    • Density defined as \(\mathrm{den}(p) = \dfrac{1}{\mathrm{kdist}(p)}\)
    • Influenced outlierness: \(\mathrm{INFLO}_k(p) = \dfrac{\sum_{o\in kIS(p)}\mathrm{den}(o)}{\lvert kIS(p)\rvert}\frac{1}{\mathrm{den}(p)}\)
      • Influence space \(kIS(p)\) of a point \(p\) is the union of the \(k\)NNs of \(p\) and reverse \(k\)NNs of \(p\) (i.e., set of points that have \(p\) as its \(k\)NNs 1)
    • Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster

Local outlier correlation integral (LOCI)

  • Papadimitriou et al (2003)
  • Similar to LOF, but take \(\epsilon\)-neighborhood instead of \(k\)NNs as reference set, and test with multiple granularities to get rid of any input parameters
  • Model:
    • \(\epsilon\)-neighborhood: \(N(p,\epsilon) = \{q \mid \textrm{dist}(p, q) \le \epsilon\}\)
    • Local density of \(p\): Cardinality of \(N(p,\epsilon)\)
    • Average density of the neighborhood: \(\mathrm{den}(p,\epsilon,\alpha) = \dfrac{\sum_{q\in N(p,\epsilon)}\lvert N(q,\alpha\epsilon)\rvert}{\lvert N(p,\epsilon)\rvert}\)
    • Multi-granularity Deviation Factor (MDEF): \(\mathrm{MDEF}(p,\epsilon,\alpha) = \dfrac{\mathrm{den}(p,\epsilon,\alpha)-\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)} = 1-\dfrac{\lvert N(p,\alpha\epsilon)\rvert}{\mathrm{den}(p,\epsilon,\alpha)}\)
      • MDEF = 0 for points within the cluster and > 0 for outliers

High-dimensional approaches

ABOD - angle-based outlier degree (Kriegel et al (2008))

  • Angles are more stable than distances in high dimensional spaces (c.f. cosine-based similarity measures for text data)
  • Object \(o\) is an outlier if most other objects are located in similar directions and not an outlier if many other objects are located in varying directions
    • if \(o\) is outlier, assume that there is a cluster located at a particular angle direction of \(o\), hence all points in the cluster are in similar direction
  • Model: angle sprectrum weighted by corresponding distances

    \[\mathrm{ABOD}(p) = \underset{x,y\in DB}{\mathrm{Var}}\left(\frac{\langle \vec{xp},\vec{yp}\rangle}{\lVert\vec{xp}\rVert^2 \lVert\vec{yp}\rVert^2}\right)\]
    • \(p\) with small ABOD is outlier

Grid-based subspace outlier detection (Aggarwal & Yu (2000))

  • Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = \(\Phi\))
  • Sparsity coefficient for a \(k\)-dimensional grid cell \(C\) that contains \(N_C\) data points:

    \[S(C) = \frac{N_C - n\Phi^{-k}}{\sqrt{n\Phi^{-k}(1-\Phi^{-k})}}\]
    • \(k\)-dimension grid cell (\(k\le d\)) is a rectangular composition of some \(d\)-dimension grid cells
    • \(S(C)<0\) means \(N_C\) is lower than expected
    • Outliers are those points that are located in lower-dimensional cells with negative sparsity coefficient

Subspace outlier degree (SOD)

  • Kriegel et al (2009)
  • Outliers may be visible only in subspaces of the original data
  • Compute the subspace in which the \(k\)NNs of a point \(p\) minimize the variance, then compute the hyperplance \(H(kNN(p))\) that is orthogonal to that subspace, and assign the distance of \(p\) to the hyperplane as measure of outlierness

Reference

http://www.cse.ust.hk/~leichen/courses/comp5331/

  1. http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF 

Bibliographic data

@misc{
   title = "Outlier Detection Techniques",
   author = "Hans-Peter Kriegel and Peer Kröger and Arthur Zimek",
   howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
   address = "Washington DC",
   year = "2010",
   url = "http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdf",
}