Deduplicator

A pretty simple and efficient data deduplicator. Works by hard linking files that have the same content. It allowed me to cram a set of backups that originally was over 700GB into just 400GB. Features saving and restoring intermediate results, so you can run it in a few short intervals. You can also take your time to review the changes, before they are committed to disk.

Installation and configuration

All you need to make this software work is a typical installation of Perl. The only module outside the standard library it requires is Digest::MD5, which is actually found in most systems. You can get the version current as of this writing from here. Inside you will find the script dedup.pl, which is all that you need. Other scripts here are just auxiliary. You can put the script somewhere in your $PATH, but I guess most people prefer to use it just once, so it may stay where you unpacked it.

Running and getting help

The basic way to run deduplication on a few locations is something along:

./dedup.pl -i progressfile -o progressfile -l /location/1 /location/2 

To get more details see the perldoc: perldoc dedup.pl

Effectiveness

I wrote this software because I needed to cram well over 700GB of old backups into a 500GB drive. As I could operate only on the 500GB drive, I had to do it incrementally. The first run on the 2.5" USB hard drive was performed on 465GiB of data. I did the digesting and linking in separate runs (first did without -l, then with it), the time digesting took was: real 89m49.877s user 7m33.034s sys 4m27.363s And then the actual linking: real 17m21.964s user 1m19.775s sys 0m56.808s It saved around 135GiB of disk space. As we can clearly see, most of the time was spent waiting for disk reads. Even if the program itself run in 0 time, we would not gain much.

Download and Development

Download current version

If you want to request changes, new features, report bugs or just see more, please visit the Fossil repository at: http://dev.lrem.net/dedup