Reading an article from Umumble but it seems not very well formatted. So I do the math here again.
Consider a hash function (e.g., hashing an arbitrary large file into MD5 checksum) that produces bit output. Since the input space is arbitrary large, we can consider the output space is surjectve, i.e., . When we apply on the output again (first time repetition), the input space is . The probability that one output is not correspond to a particular input is , hence the probability that one output is not correspond to any input equals to . Consider the general case that the input space to is and its output space is still . The probability that a particular output is not correspond to any input is
where the last approximation is assuming large enough.
Denote to be the input space for repetitions, i.e. , , , then = . Hence monotonically decreases (i.e. repeated hashing is bad in terms of collision count) but the rate of decrease is reduced exponentially (i.e. repeated hashing do more harm at the first couple repetitions).
To numerically verify the reduction of output space, run the following Python code:
#!/usr/bin/env python from math import exp, log K, N = 100000, 2**128 Mk = N print "k\tlog2(M_k)\n1\t%.3f" % (log(Mk)/log(2)) for i in range(1,K+1): Mk = Mk * exp(-Mk/N) print "%d\t%.3f" % (i+1, log(Mk)/log(2))
which gives the following output ( and up to ):
k log2(M_k) 1 128.000 2 126.557 3 126.027 4 125.659 5 125.374 .... 99997 111.390 99998 111.390 99999 111.390 100000 111.390