Assume you have a million lines of data in ordered pairs and you want to plot it in GNU Plot. That’s likely to kill your computer. Indeed, plotting a million dots on a graph does not give more information than that plotting a thousand dots. Therefore, how we can sample a thousand dots from a million dots is an issue.

GNU Plot allows you to pipe data through a command, for example, instead of

plot "million.dots" using 1:2 with points


we can have

plot "< cat million.dots | sample 1000" using 1:2 with points


so that the data are sampled by the script sample.

## Reservoir sampling

Sampling is trivial, but an online sampling may be not. Reservoir sampling is to do online sampling with a pool of fixed size. The requirement here is that, every input is sampled in equal probability. Check out my contribution on Wikipedia for mathematical proof, below is my code: reservoir.pl

#!/usr/bin/perl
#
# Script to do reservoir sampling. The script takes the first parameter as the
# reservoir sizea k and then sample the inputted file into k lines. The output
# is sorted in the original order.
#
# Synopsis:
#    cat BigFile | reservoir.pl 4
#  or
#    reservoir.pl 4 BigFile
#
use strict;
use warnings;
use 5.010;

my $k = shift // 1; my @lines; my @linum; my$n = 0;
srand;
while(<>) {
$n++; if ($n <= $k) { push @lines, [$n,$_]; } elsif (rand() < ($k/$n)) {$lines[ int(rand($k)) ] = [$n,$_]; }; } foreach (sort {$$a[0] <=>$$b[0]} @lines) { print$\$_[1]; };