01/29/09

Making a Hash of Enterprise Deduplication

Permalink 05:05:43 pm, Categories: Notes  

First, let me introduce myself. I work for SEPATON helping large enterprises improve their backup environments. In this blog I’d like to share my observations and real world experiences as I travel to around the country speaking with data managers about the backup challenges they face and the smart ways they are meeting those challenges.

Lately, as I talk to enterprise data center managers, I am concerned that many have made the costly mistake of trying to use a hash-based, inline deduplication system in their enterprise data center. Hash systems are relatively simple, and works great for SMB or department-level implementations where the volumes are small (<10TB full) and relatively slow restore times are acceptable.

However, in an enterprise data center, hash-based deduplication is about as effective as using a water pistol on the Chicago fire. Here are five reasons why:

1. Hash-based, inline deduplication makes virtual tape slower than physical tape.

Hash-based deduplication has to execute mathematical calculations on every block of data coming into the VTL before storing it on disk. This process can restrict individual Fibre Channel port performance to less than 50MB/sec. This problem is compounded by the fact that hash systems are typically limited to a single node for processing. With a hash-based deduplication, the VTL you invested in for high performance will backup slower than a tape drive.

By contrast, in concurrent processing solutions, the backup and deduplication processes overlap and can be load balanced across multiple nodes. As a result, concurrent deduplication software has minimal impact on FC port performance—letting the VTL backup data and/or deduplicate 1 TB per hour per FC port.

2. Hash-based inline systems don’t scale.
You chose a VTL to reduce complexity. However, hash-based deduplication software that can’t scale will make your backup environment a lot more complex. They force you to create and manage multiple small virtual environments. Clustering and shared-memory limitations cause hash-based VTL products to top out at around 400MB/sec and less than 50TB of capacity—roughly equivalent to small physical tape autoloaders.

Vendors try to hide these scalability limitations by combining several of these little ‘virtual autoloaders’ under a single management console. But this problem cannot be hidden from the backup software. Think about it. You would need at least eight of them (and eight library devices) to handle a 400TB backup. That’s eight times the management, eight times the capacity planning, and eight times the overall complexity.

3. Poor deduplication efficiency is unavoidable.
As described above, hash based systems have to put your data on multiple “virtual autoloaders” or libraries to scale. Data cannot be compared between these libraries so they fail to remove a significant amount of duplicate data. However, a system with concurrent, ContentAware deduplication, combined with a scalable VTL can deduplicate and manage petabytes of data in a single system. This software looks at the actual content of the data to identify objects (files, spreadsheets, databases, etc.) that contain duplicate data. By narrowing the search, it can then look for duplicate data at the byte level.

4. Restore performance degrades over time.
Hash-based deduplication products use a strategy called reverse referencing. The first time a block of data is stored the software designates it as a reference block.

When a duplicate block of data is identified (i.e. a hash hit), the software replaces it with a pointer to the original reference block. Over time, as more reference blocks are stored, backup jobs are broken into more and more pieces—new reference blocks and pointers to older reference blocks. To restore data stored this way, the system has to reassemble data from numerous reference blocks and pointers, which slows performance significantly. Ironically, the most recent backup requires the most reassembly.

Deduplication technology that uses forward referencing work in the opposite way. These systems store an intact version of the most recent backup and replace older duplicate data with pointers forward to it. Since the most recent full backups are stored on the VTL intact, restore and migration operations can be performed at full speed without reassembly.

5. Hash-based deduplication is “all or nothing”.
Enterprise backup environments are typically more complex than their SMB or department-level counterparts are. You may have significant volumes of data that cannot be deduplicated or should not be deduplication for regulatory reasons.

Hash based systems require you to deduplicate all of the data stored on the VTL -- wasting significant compute cycles and limiting the efficiency of your entire process. They add more complexity by forcing you to backup data that you do not want to deduplicate to a separate system.

An enterprise-optimized deduplication software identifies each backup job as it is sent to the VTL and enables you to choose the level of deduplication you want to perform on each (including no deduplication at all). Yuu can get detailed reports on the deduplication efficiency of each backup job to enable you to fine-tune your backups for optimal results.

For more information about deduplication technologies check out this Network World podcast!