Skip to Navigation

Deduplicator

A pretty simple and efficient data deduplicator. Works by hard linking files that have the same content. It allowed me to cram a set of backups that originally was over 700GB into just 400GB. Features saving and restoring intermediate results, so you can run it in a few short intervals. You can also take your time to review the changes before they are committed to disk.

Installation and configuration

All you need to make this software work is a typical installation of Perl. The only module outside the standard library it requires is use Digest::MD5, which is actually found in most systems.

You can get the version current as of this writing from here. Inside you will find the script dedup.pl, which is all that you need. Other scripts here are just auxiliary. You can put the script somewhere in your $PATH, but I guess most people prefer to use it just once, so it may stay where you unpacked it.

Running and getting help

The basic way to run deduplication on a few locations is something along:

./dedup.pl -i progressfile -o progressfile -l /location/1 /location/2

To get more details see the perldoc:

 perldoc dedup.pl

Effectiveness

I wrote this software because I needed to cram well over 700GB of old backups into a 500GB drive. As I could operate only on the 500GB drive, I had to do it incrementally. The first run on the 2.5" USB hard drive was performed on 465GiB of data. I did the digesting and linking in separate runs (first did without -l, then with it), the time digesting took was:

real    89m49.877s
user    7m33.034s
sys     4m27.363s

And then the actual linking:
real    17m21.964s
user    1m19.775s
sys     0m56.808s

It saved around 135GiB of disk space. As we can clearly see, most of the time was spent waiting for disk reads. Even if the program itself run in 0 time, we would not gain much.

Development

If you want to request changes, new features, report bugs or just see more, please visit the Fossil repository at:
http://dev.lrem.net:8003/