Outlier: definition by Hawkins (1980), an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism
 mechanism: statistical process
 Example: Hadlum v Hadlum (1949), birth of a child happened 349 days after Mr Hadlum left for military service while average human gestation period is 280 days
General application scenarios: The tutorial focus on unsupervised
 Supervised: Training data with normal and abnormal data objects are provided, but classification problem is highly imbalanced
 Semisupervised: Only training data with normal (or abnormal) classes are provided
 Unsupervised: No training data available
Outlier = Side product of clustering algorithm
 What clustering considered as “noise”
 Outlier detection = Extract noise set from clustering output
 problems: set of outliers can be considered as a cluster; accuracy of outlier detection depends on how good the clustering algorithm captures the structure
Global vs local classification approaches
 Given a set of reference objects and evaluate for the outlierness of each data point
 global: assumed there is only one normal mechanism
 problem: outliers in the reference set may falsify the result
 local: no assumption on the number of normal mechanisms
 problem: how to choose a reference set
Labelling vs scoring
 considers the output of an outlier detection algorithm
 labeling: binary
 scoring: continuous output (e.g., probability for being an outlier)
Modelbased approach:
 Apply a model to represent normal data points; outlier are the points that do not fit into the model
 Probabilistic tests
 Deviationbased
 Proximitybased approach: Examine the spatial proximity of each object in the
data space, if an object deviates from the proximity of other objects a lot,
it is considered as an outlier
 distancebased
 densitybased
 Angle based: Examine the spectrum of pairwise angles between a given point and all other points, outliers are points that have a spectrum featuring high fluctuation
Model based
Statistical tests
Given a statistical distribution (e.g., Gaussian), compute the parameters (e.g., mean and std dev in case of Gaussian) assuming all data points have been generated by such a distribution. Outliers are the points that have low probability to be generated by the distribution.
Multivariate normal density function:
 Dimensionality of :
 is the covariance matrix from the mean
 Mahalanobis distance of point to :
 Mahalanobis distance follows a distribution with degree of freedom
 For example, outlier are those or
 Suffer from curse of dimensionality: The higher the degree of freedom , the more similar the value of for all points
Depthbased approaches
Look for outlier at the border of data space, independent of the statistical distribution
 Find layers of convex hull to confine the data, then the points on outer layer are the outlier
 Assuming outlier are at the border of the data space
 Usually only efficient in 2D or 3D spaces
Deviationbased approaches
Assume some general characteristics of data points, and outlier are those that do not fit them
 Example: Outlier are the points that if removed, the variance of the set is minimized
 Model of Arning et al (1996):
 Smoothing factor that computes for each , the magnitude of decrease of variance of when is removed from it
 Outlier are the elements of the exception set such that for all
 Search space is for data points, heuristics such as random sampling are applied
Proximitybased
Distancebased approaches
Judge a point based on the distance to its neighbors. Assuming that outliers are far apart from their neighbors (or having a less dense neighborhood)
 Knorr and Ng (1997)
 A point is an outlier if at most percent of all other points have a distance to less than :
 Computation: using spatial index structure to compute distance range, using nested loop, using grid of size less than to hold points of proximity
 Application: Deriving intensional knowledge
 Find the minimal subsets of attributes in which the points are still an outlier
Outlier scoring based on NN distances:
 NN distance; or
 Aggregate of 1NN, 2NN, …, NN distances
 Algorithms:
 nested loop to compute NN
 partitioning using microclusters
 linearization of multidimensional data set using spacefill curves
 ORCA (Bay & Schwabacher (2003)), nested loop with randomization and pruning
 pruning: keep only top outlier so far by score
Densitybased approaches
Compare the density around a point with the density around its local neighbors. The relative density of a point compared to its neighbors is computed as an outlier score.
Local outlier factor (LOF)
 Breunig et al (1999)
 Motivation: Distancebased outlier detection models have problems with
different densities
 multiple clusters of data, with different density, but some point are a bit farther away from the nearest dense cluster but not far enough to differentiate from the less dense cluster
 solution: not compare distance with the global data set, but compare with the nearest subset
 Model:
 Reachability distance from to :
 is the distance of to its th nearest neighbor

Local reachability distance: Inverse of average of the reachability distance of the NNs of :

Local outlier factor of is the average ratio of LRDs of neighbors of to LRD of :
 LOF close to 1: the point is in cluster (homogeneous density around the point and its neighbors); LOF much larger than 1: the point is an outlier
 Reachability distance from to :
 Variation: Mining for top local outlier instead of all outlier
 Compress data points into microclusters, then derive upper and lower bounds of reachability distances, LRDs and LOFs for points within a microclusters
 Then prune microclusters that cannot accommodate top outlier, and refine microclusters iteratively
Influenced outlierness (INFLO)
 Jin et al (2006)
 Cluster of different densities are not clearly separated (e.g., geometrically overlapped), LOF will have problems
 Take symmetric neighborhood relationship into account
 Density defined as
 Influenced outlierness:
 Influence space of a point is the union of the NNs of and reverse NNs of (i.e., set of points that have as its NNs ^{1})
 Similar to LOF, INFLO much greater than 1 is an outlier and close to 1 is in a cluster
Local outlier correlation integral (LOCI)
 Papadimitriou et al (2003)
 Similar to LOF, but take neighborhood instead of NNs as reference set, and test with multiple granularities to get rid of any input parameters
 Model:
 neighborhood:
 Local density of : Cardinality of
 Average density of the neighborhood:
 Multigranularity Deviation Factor (MDEF):
 MDEF = 0 for points within the cluster and > 0 for outliers
Highdimensional approaches
ABOD  anglebased outlier degree (Kriegel et al (2008))
 Angles are more stable than distances in high dimensional spaces (c.f. cosinebased similarity measures for text data)
 Object is an outlier if most other objects are located in similar
directions and not an outlier if many other objects are located in varying
directions
 if is outlier, assume that there is a cluster located at a particular angle direction of , hence all points in the cluster are in similar direction

Model: angle sprectrum weighted by corresponding distances
 with small ABOD is outlier
Gridbased subspace outlier detection (Aggarwal & Yu (2000))
 Partition data space by lattice grid of equal depth at each dimension (grid size per dimension = )

Sparsity coefficient for a dimensional grid cell that contains data points:
 dimension grid cell () is a rectangular composition of some dimension grid cells
 means is lower than expected
 Outliers are those points that are located in lowerdimensional cells with negative sparsity coefficient
Subspace outlier degree (SOD)
 Kriegel et al (2009)
 Outliers may be visible only in subspaces of the original data
 Compute the subspace in which the NNs of a point minimize the variance, then compute the hyperplance that is orthogonal to that subspace, and assign the distance of to the hyperplane as measure of outlierness
Reference
http://www.cse.ust.hk/~leichen/courses/comp5331/

http://www.dblab.ntua.gr/~gtsat/collection/RKNN/RS20P1.PDF ↩
Bibliographic data
@misc{
title = "Outlier Detection Techniques",
author = "HansPeter Kriegel and Peer Kröger and Arthur Zimek",
howpublished = "Tutorial, 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
address = "Washington DC",
year = "2010",
url = "http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10outliertutorial.pdf",
}