Assume you have a directory or directories of files. Some of them are duplicated. You want to remove the duplicates to conserve space. This is the program in Python:

#!/usr/bin/python
# dedup.py - Sat, 23 Apr 2011 00:03:30 -0400
# As command line arguments are files of md5sum's. The md5sums are read one
# by one in their order and all duplicated files are removed, only the
# files of their first appearance are kept in their old path.

import sys;
import os.path;

f = {}	# File hashes

for file in sys.argv:
md5file = open(file)
for line in md5file.read().splitlines():
if f.has_key(line[0:32]):
print "Remove "+line[34:]
os.remove(line[34:])
else:
print "Keep "+line[34:]
f[line[0:32]] = line[34:]
md5file.close()


The code assumes the files to search for duplicates are filtered through md5sum, i.e.

find dir/ -type f -exec md5sum \{\} \; > md5sum.file


The files of md5sum are provided as command line arguments. The order is significant: in case of duplicate according to MD5 hash, only the first appeared file is kept.