Deduplicator
A pretty simple and efficient data deduplicator. Works by hard linking files that have the same content. It allowed me to cram a set of backups that originally was over 700GB into just 400GB. Features saving and restoring intermediate results, so you can run it in a few short intervals. You can also take your time to review the changes before they are committed to disk.
Installation and configuration
All you need to make this software work is a typical installation of Perl. The only module outside the standard library it requires is use Digest::MD5, which is actually found in most systems.
You can get the version current as of this writing from here. Inside you will find the script dedup.pl, which is all that you need. Other scripts here are just auxiliary. You can put the script somewhere in your $PATH, but I guess most people prefer to use it just once, so it may stay where you unpacked it.
Running and getting help
The basic way to run deduplication on a few locations is something along:
./dedup.pl -i progressfile -o progressfile -l /location/1 /location/2
To get more details see the perldoc:
perldoc dedup.pl
Effectiveness
I wrote this software because I needed to cram well over 700GB of old backups into a 500GB drive. As I could operate only on the 500GB drive, I had to do it incrementally. The first run on the 2.5" USB hard drive was performed on 465GiB of data. I did the digesting and linking in separate runs (first did without -l, then with it), the time digesting took was:
user 7m33.034s
sys 4m27.363s
And then the actual linking:
user 1m19.775s
sys 0m56.808s
It saved around 135GiB of disk space. As we can clearly see, most of the time was spent waiting for disk reads. Even if the program itself run in 0 time, we would not gain much.
Development
If you want to request changes, new features, report bugs or just see more, please visit the Fossil repository at:
http://dev.lrem.net:8003/