What is Deduplication & How does it Work?

What is Deduplication?

Making use of data deduplication as part of the backup process eradicates redundant data by referencing duplicate copies of data (think about all those duplicate PowerPoint files you probably have on your PC or Mac) to one single reference copy – and then by saving only data blocks that have changed since the last back-up.

Since merely a small percentage of data changes daily, deduplication can result in huge–up to 90%-savings in storage capacity expenses alone.

How Deduplication Works

Deduplication systems segment all incoming data –in this situation back up data–into “chunks” and calculate a unique check sum or hash mark for each chunk. That permits the system to evaluate each chunk’s checksum against those already backed up. If an incoming chunk’s checksum is unique, that means that it hasn’t been previously backed up, and it is saved. If an incoming chunk’s checksum matches that of a recently backed up chunk, it will be referenced to the recently backed up chunk. Storage demands are decreased tremendously by eliminating the storing of multiple redundant copies of data.

What do we mean by a data “chunk”?  Deduplication systems can be carried out at the file level or block level. The block level  is much more granular than the file level.

Deduplication operations may be carried out online (duplicates are found during the procedure of creating a backup) or offline (deduplication occurs after the backup procedure by reading through the backup file and eradicating duplicates).

There are trade-offs associated with file vs. block and online vs. offline deduplication.  Which is right for you?  That depends on your unique needs.


Deduplication in its various forms is a highly recommended “best practice”

File level deduplication

File level deduplication eradicates similar copies of the entire files. Instead of marking a data block with a check sum, it marks the whole file. A file is only backed up when its check sum does not fit any file recently seen by the deduplication system. File level deduplication is usually called―single instance store.

Block level deduplication

Block level deduplication offers the granularity that’s required to drastically decrease backup over head. It demarcates a volume into fixed or variable sized blocks and a check sum is applied to each block. Unique blocks are backed up and unnecessary blocks are changed by references to a recently backed up file that matches. The lesser the specified block size, the less unnecessary data is backed up and saved. Obviously, when block size is extremely small, more check sum matching has to be carried out.

Online deduplication

Online deduplication systems operate between the backup systems along with the near line storage device. Since data moves from the system to the device, deduplication works―on the fly decreasing both storage capacity and storage data transfer.

Offline deduplication

Offline deduplication systems procedure backups as soon as they have been written to a disk. Duplicate data is identified within and across backups at either the file or block level and restored with recommendations. The offline procedure has two disadvantages. First, the backup has to be saved before deduplication, requiring short-term storage capacity to be available. Second, backup files have to be read, copied, and the new deduplicated backup written to disk which leads to high disk bandwidth demands.

