Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism

  • mechanism: statistical process
  • Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days

General application scenarios: The tutorial focus on unsupervised

  • Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
  • Semi-supervised: Only training data with normal (or abnormal) classes are provided
  • Unsupervised: No training data available

Outlier = Side product of clustering algorithm

  • What clustering considered as “noise”
  • Outlier detection = Extract noise set from clustering output
    • problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure

Global vs local classification approaches

  • Given a set of reference objects and evaluate for the outlierness of each data point
  • global: assumed there is only one normal mechanism
    • problem: outliers in the reference set may falsify the result
  • local: no assumption on the number of normal mechanisms
    • problem: how to choose a reference set

Labelling vs scoring

  • considers the output of an outlier detection algorithm
  • labeling: binary
  • scoring: continuous output (e.g., probability for being an outlier)

Model-based approach:

  • Apply a model to represent normal data points; outlier are the points that do not fit into the model
    • Probabilistic tests
    • Deviation-based
  • Proximity-based approach: Examine the spatial proximity of each object in the data space, if an object deviates from the proximity of other objects a lot, it is considered as an outlier
    • distance-based
    • density-based
  • Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation

Model based

Statistical tests

Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.

Multivariate normal density function:

  • Dimensionality of :
  • is the covariance matrix from the mean
  • Mahalanobis distance of point to :
    • Mahalanobis distance follows a -distribution with degree of freedom
    • For example, outlier are those or
  • Suffer from curse of dimensionality: The higher the degree of freedom , the more similar the value of for all points

Depth-based approaches

Look for outlier at the border of data space, independent of the statistical distribution

  • Find layers of convex hull to confine the data, then the points on outer layer are the outlier
  • Assuming outlier are at the border of the data space
  • Usually only efficient in 2D or 3D spaces

Deviation-based approaches

Assume some general characteristics of data points, and outlier are those that do not fit them

  • Example: Outlier are the points that if removed, the variance of the set is minimized
  • Model of Arning et al (1996):
    • Smoothing factor that computes for each , the magnitude of decrease of variance of when is removed from it
    • Outlier are the elements of the exception set such that for all
    • Search space is for data points, heuristics such as random sampling are applied

Proximity-based

Distance-based approaches

Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)

  • Knorr and Ng (1997)
  • A point is an outlier if at most percent of all other points have a distance to less than :
  • Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than to hold points of proximity
  • Application: Deriving intensional knowledge
    • Find the minimal subsets of attributes in which the points are still an outlier

Outlier scoring based on NN distances:

  • NN distance; or
  • Aggregate of 1NN, 2NN, …, NN distances
  • Algorithms:
    • nested loop to compute NN
    • partitioning using micro-clusters
    • linearization of multi-dimensional data set using space-fill curves
    • ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
      • pruning: keep only top- outlier so far by score

Density-based approaches

Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.

Local outlier factor (LOF)

  • Breunig et al (1999)
  • Motivation: Distance-based outlier detection models have problems with different densities
    • multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
    • solution: not compare distance with the global data set, but compare with the nearest subset
  • Model:
    • Reachability distance from to :
      • is the distance of to its -th nearest neighbor
    • Local reachability distance: Inverse of average of the reachability distance of the NNs of :

    • Local outlier factor of is the average ratio of LRDs of neighbors of to LRD of :

    • LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
  • Variation: Mining for top- local outlier instead of all outlier
    • Compress data points into micro-clusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a micro-clusters
    • Then prune micro-clusters that cannot accommodate top- outlier, and refine micro-clusters iteratively

Influenced outlierness (INFLO)

  • Jin et al (2006)
  • Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
  • Take symmetric neighborhood relationship into account
    • Density defined as
    • Influenced outlierness:
      • Influence space of a point is the union of the NNs of and reverse NNs of (i.e., set of points that have as its NNs 1)
    • Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster

Local outlier correlation integral (LOCI)

  • Papadimitriou et al (2003)
  • Similar to LOF, but take -neighborhood instead of NNs as reference set, and test with multiple granularities to get rid of any input parameters
  • Model:
    • -neighborhood:
    • Local density of : Cardinality of
    • Average density of the neighborhood:
    • Multi-granularity Deviation Factor (MDEF):
      • MDEF = 0 for points within the cluster and > 0 for outliers

High-dimensional approaches

ABOD - angle-based outlier degree (Kriegel et al (2008))

  • Angles are more stable than distances in high dimensional spaces (c.f. cosine-based similarity measures for text data)
  • Object is an outlier if most other objects are located in similar directions and not an outlier if many other objects are located in varying directions
    • if is outlier, assume that there is a cluster located at a particular angle direction of , hence all points in the cluster are in similar direction
  • Model: angle sprectrum weighted by corresponding distances

    • with small ABOD is outlier

Grid-based subspace outlier detection (Aggarwal & Yu (2000))

  • Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = )
  • Sparsity coefficient for a -dimensional grid cell that contains data points:

    • -dimension grid cell () is a rectangular composition of some -dimension grid cells
    • means is lower than expected
    • Outliers are those points that are located in lower-dimensional cells with negative sparsity coefficient

Subspace outlier degree (SOD)

  • Kriegel et al (2009)
  • Outliers may be visible only in subspaces of the original data
  • Compute the subspace in which the NNs of a point minimize the variance, then compute the hyperplance that is orthogonal to that subspace, and assign the distance of to the hyperplane as measure of outlierness

Reference

http://www.cse.ust.hk/~leichen/courses/comp5331/

  1. http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF 

Bibliographic data

@misc{
   title = "Outlier Detection Techniques",
   author = "Hans-Peter Kriegel and Peer Kröger and Arthur Zimek",
   howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
   address = "Washington DC",
   year = "2010",
   url = "http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdf",
}