Deduplicator
A pretty simple and efficient data deduplicator. Works by hard linking files that have the same content. It allowed me to cram a set of backups that originally was over 700GB into just 400GB. Features saving and restoring intermediate results, so you can run it in a few short intervals. You can also take your time to review the changes, before they are committed to disk.
Installation and configuration
All you need to make this software work
is a typical installation of Perl.
The only module outside the standard library it requires is Digest::MD5,
which is actually found in most systems.
You can get the version current as of this writing from here.
Inside you will find the script dedup.pl,
which is all that you need.
Other scripts here are just auxiliary.
You can put the script somewhere in your $PATH
,
but I guess most people prefer to use it just once,
so it may stay where you unpacked it.
Running and getting help
The basic way to run deduplication on a few locations is something along:
./dedup.pl -i progressfile -o progressfile -l /location/1 /location/2
To get more details see the perldoc: perldoc dedup.pl
Effectiveness
I wrote this software because I needed to cram
well over 700GB of old backups into a 500GB drive.
As I could operate only on the 500GB drive,
I had to do it incrementally.
The first run on the 2.5" USB hard drive was performed on 465GiB of data.
I did the digesting and linking in separate runs
(first did without -l
, then with it),
the time digesting took was:
real 89m49.877s user 7m33.034s sys 4m27.363s
And then the actual linking:
real 17m21.964s user 1m19.775s sys 0m56.808s
It saved around 135GiB of disk space.
As we can clearly see,
most of the time was spent waiting for disk reads.
Even if the program itself run in 0 time,
we would not gain much.
Download and Development
If you want to request changes, new features, report bugs or just see more, please visit the Fossil repository at: http://dev.lrem.net/dedup