This is from a master thesis that try to downsample a line chart (time series) but retain all visual characteristics. What I know about downsampling is to assume the input is a signal and pass it through a low-pass filter, as we have learned in the undergraduate signal & system course. But we can’t guarantee the visual characteristics will not be distorted (Gibbs effect!), especially we can’t sure the maxima and minima are retained.

The intuitive way to do downsampling is to split the \(x\) domain of the line chart into some equal segments, and pick one point from each segment. Each segment is named a bucket in the thesis. We can pick the points by taking the mean, mode, median, min, or max of the \(y\) values in each bucket. Strictly speaking, data points after downsample should be subset of input. If this rule applies, we cannot use mean as the method of selection (nor distorting the \(x\) value of the selected data point). But each of these simple selection rule will have certain weakness, for example,

  • cannot retain local peak and troughs in downsample
  • cannot keep both peak and troughs from the same bucket in the downsample
  • cannot retain fluctuation pattern in downsample

Such simple intuitive way of downsample (using mode or median as selection rule) is named “mode-median-bucket” in the thesis.

An improved method is named “longest line bucket” (LLB). Similar bucketing is applied but with the leftmost and rightmost data point in their own bucket. And an edge is connecting every point pairs across the bucket boundary, for example, if bucket \(i\) and bucket \(i+1\) has \(N_i\) and \(N_{i+1}\) points respectively, there will be \(N_iN_{i+1}\) edges created. Such edges are directed from left to right, hence the whole thing becomes a DAG. Each edge carries a weight, defined as the Euclidean distance between the points. The downsampling is to find the path from first to last data point on the DAG that maximize the sum of edge weights, e.g. using Dijkstra’s algorithm. The LLB algorithm can retain most visual characteristics by a high likelihood of including local maxima and minima. However, finding the path with max weight involves dynamic algorithm and this could be slow.

A similar problem from cartography is to simplify a polyline on map. The thesis mentioned two algorithms to do so:

  1. Douglas-Peucker algorithm (1973)
    • start: draw line segments from first to last point
    • select the point that is furtherest away, and the distance is above certain threshold, from the existing line segments
    • revise the line segment(s) to include that point
    • call recursively until all points not include in the line segment has distance below threshold
  2. Visvalingam-Whyatt algorithm (1993)
    • define effective area of a point as the area of triangle constructed by that point and its two adjacent points (considered both distance and the angle between points)
    • assign each point the weight of its effective area
    • recursively exclude the points of least weight, then recalculate the effective area and reassign the weight

We may apply the Visvalingam-Whyatt algorithm to downsample a line chart but we have to restrict it from skipping too many consecutive data points, or otherwise we are very likely to replace a region of minor fluctuation with a line segment. Bucketing is the solution to guarantee at least one data point is kept in the downsample output.

Assigning effective area to each point as the weight and select the point with greatest weight in each bucket into the downsample will produce visually comparable result to LLB method above, without the Dijkstra’s algorithm. This is named the largest-triangle-one-bucket (LTOB) method. Keeping only the point with largest weight per bucket will implicitly exclude all points that are colinear with adjacent points.

Improving the LTOB method is to change the effective area definition to be not the adjacent points of a point, but any point in adjacent bucket. Assume we have \(N_{i-1}\), \(N\), \(N_{i+1}\) points respectively in adjacent buckets \(i-1\), \(i\), \(i+1\). For each point in bucket \(i\), we have \(N_{i-1}N_{i+1}\) possible triangle formation and we assign the maximum of them as the weight to such point. However, assuming we are scanning the line chart from left to right, and we have the first and last data point in its own bucket, we always have a point selected in bucket \(i-1\) when we are working on bucket \(i\). Therefore, we only need to find the max effective area among \(N_{i+1}\) options and the complexity for selecting a point in each bucket is \(O(N^2)\) only. A way to further improve this to \(O(N)\) complexity is to use a fixed imaginary point in bucket \(i+1\) for the effective area calculation for points in bucket \(i\). The imaginary point is defined as the mean of all points in that bucket. Since we are working on three buckets each time, it is largest-triangle-three-bucket (LTTB) method.

Above assumed we divide the buckets into equal size of interval on \(x\)-axis. Dynamic bucketing can be applied: We start with equal interval bucketing. Next, consider the sum of squared error (SSE) of the \(y\)-value of all points in a bucket to their mean, we split the bucket into two if this is too large and merge with neighboring bucket if this is too low. By running certain iterations on this splitting/merging of buckets, the bucketing should be more frequent in region of high fluctuations. Number of iterations to run is suggested as \(10 \times \frac{\textrm{size of input sample}}{\textrm{size of downsample}}\) in the thesis.

Reference

  1. Ben Lau point this out to me.