Assume you have a million lines of data in ordered pairs and you want to plot it in GNU Plot. That’s likely to kill your computer. Indeed, plotting a million dots on a graph does not give more information than that plotting a thousand dots. Therefore, how we can sample a thousand dots from a million dots is an issue.
GNU Plot allows you to pipe data through a command, for example, instead of
plot "million.dots" using 1:2 with points
we can have
plot "< cat million.dots | sample 1000" using 1:2 with points
so that the data are sampled by the script sample
.
Reservoir sampling
Sampling is trivial, but an online sampling may be not. Reservoir sampling is to
do online sampling with a pool of fixed size. The requirement here is that,
every input is sampled in equal probability. Check out my contribution on
Wikipedia
for mathematical proof, below is my code: reservoir.pl
#!/usr/bin/perl
#
# Script to do reservoir sampling. The script takes the first parameter as the
# reservoir sizea k and then sample the inputted file into k lines. The output
# is sorted in the original order.
#
# Synopsis:
# cat BigFile | reservoir.pl 4
# or
# reservoir.pl 4 BigFile
#
use strict;
use warnings;
use 5.010;
my $k = shift // 1;
my @lines;
my @linum;
my $n = 0;
srand;
while(<>) {
$n++;
if ($n <= $k) {
push @lines, [$n,$_];
} elsif (rand() < ($k/$n)) {
$lines[ int(rand($k)) ] = [$n,$_];
};
}
foreach (sort {$$a[0] <=> $$b[0]} @lines) { print $$_[1]; };