Reservoir Sampling with Uniform Probability

Assume you have a million lines of data in ordered pairs and you want to plot it in GNU Plot. That’s likely to kill your computer. Indeed, plotting a million dots on a graph does not give more information than that plotting a thousand dots. Therefore, how we can sample a thousand dots from a million dots is an issue.

GNU Plot allows you to pipe data through a command, for example, instead of

plot "million.dots" using 1:2 with points

we can have

plot "< cat million.dots | sample 1000" using 1:2 with points

so that the data are sampled by the script sample.

Reservoir sampling

Sampling is trivial, but an online sampling may be not. Reservoir sampling is to do online sampling with a pool of fixed size. The requirement here is that, every input is sampled in equal probability. Check out my contribution on Wikipedia for mathematical proof, below is my code: reservoir.pl

#!/usr/bin/perl
#
# Script to do reservoir sampling. The script takes the first parameter as the
# reservoir sizea k and then sample the inputted file into k lines. The output
# is sorted in the original order.
#
# Synopsis:
#    cat BigFile | reservoir.pl 4
#  or
#    reservoir.pl 4 BigFile
#
use strict;
use warnings;
use 5.010;

my $k = shift // 1;
my @lines;
my @linum;
my $n = 0;
srand;
while(<>) {
	$n++;
	if ($n <= $k) {
		push @lines, [$n,$_];
	} elsif (rand() < ($k/$n)) {
		$lines[ int(rand($k)) ] = [$n,$_];
	};
}
foreach (sort {$$a[0] <=> $$b[0]} @lines) { print $$_[1]; };