click to see the top 10 SSD oems
top 10 SSD companies ..
SSDs over 163  current & past oems profiled
SSD news ..
image shows software factory - click to see storage software directory
SSD software ....

enterprise buyers guides since 1991

storage search
"leading the way to the new storage frontier"

Is Deduplication of Data Safe? - and More Deduplication FAQs

by Philip Turner, Regional Director, UK & Ireland, Data Domain - September 22, 2008

Data Domain - click for company profile
this way to the petabyte SSD, Data Recovery, SSD Backup
Editor's intro:-Deduplication is green - but is it safe? Dedupe started out as a simple idea. But it could never stay that way. Vendors offer different approaches each of what they think are best. The complexity of operating in different environments mean that design choices or compromises have to be made. Consequently the simple idea of "dedupe" has spawned many complex interpretations. This ultimate dedupe faqs article gives you a solid starting point for readers looking at this option. SSD ad - click for more info

Deduplication FAQs

SSD ad - click for more info
Do you have impure thoughts
about deduping SSDs?
Editor:- April 11, 2013 - What comes to your mind when you think about SSDs and dedupe?

A theoretical ratio? - x2, x5, x10...

Or maybe you groan? - It's too messy to manage and even if capacity gets better, something else gets worse - so let's just forget the idea...

A recent blog - Introducing the SSD Dedupe Ticker - by Pure Storage - looks at the state of customer reaility in this aspect of SSD array technology and comments on the variations you can get according to the type of app and the way of doing the dedupe.

Among other things the article also looks at the biggie question - of performance impact - answering the author's rhetorical question - "why hasnt deduplication taken the primary storage world by storm like it has the backup world?" the article
SandForce dedupes inside the SSD

Editor:- January 19, 2011 - Did you know that SandForce's SSD controllers do compression and dedupe as some of the tactics to manage flash endurance?

I suspected it - because some other designs do it too - but I wasn't sure. In the case of SandForce this design approach was confirmed in an article published recently in Electronic Design.

One consequence for designers of solid state storage arrays is that some SSDs may not be as compressible as HDDs with the same nominal capacity.
SSD ad - click for more info

Where are we now with SSD software? - (And how did we get into this mess?)

the Survivor's Guide to Enterprise SSDs - a list of do's and don'ts

Strategic Transitions in SSD - roundup of recent disruptive changes in the SSD market
picture of Philip Turner author of this dedupe faqs articles
About the author...

Philip Turner is Data Domain's Regional Director
for UK and Ireland with over 15 years' experience
in the storage market.

Prior to joining Data Domain in 2007, Philip served as
EMEA Director for Acopia Networks. He was also
one of the first European employees at NetApp,
where he held the position of District Manager for the
public sector and enterprise accounts.

Philip has also held positions at IronPort Systems
and Computer Associates.

Here are some other editor selected articles which have interesting things to say about dedupe...
  • How Safe Is Deduplication? ...when evaluating deduplication technologies find out how the vendor identifies duplicates and ask about the risk of hash collisions ...
  • Dedupe Performance Rant - you try to recover from a "full" backup on the data domain, but that file has been living in it for a year...
  • Aspects of Disk Backup - comprehensively reviews the why? how? and where? of today's modern enterprise disk backup techniques.
1 What is data deduplication?

Deduplication is similar to data compression, but it looks for redundancy of very large sequences of bytes across very large comparison windows. Long (8KB+) sequences are compared to the history of other such sequences, and where possible, the first uniquely stored version of a sequence is referenced rather than stored again. In a storage system, this is all hidden from users and applications, so the whole file is readable after having been written.
2 Why deduplicate data?

Eliminating redundant data can significantly shrink storage requirements and improve bandwidth efficiency. Because primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that new work can re-use old work. Some operations like Backup store extremely redundant information. Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer. In the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data. The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organisation.
3 How does data deduplication work?

Deduplication segments the incoming data stream, uniquely identifies the data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. If the segment is unique, it is stored on disk.

For example, a file or volume that is backed up every week creates a significant amount of duplicate data. Deduplication algorithms analyse the data and can store only the compressed, unique change elements of that file. This process can provide an average of 10-30 times or greater reduction in storage capacity requirements, with average backup retention policies on normal enterprise data. This means that companies can store 10TB to 30TB of backup data on 1TB of physical disk capacity, which has huge economic benefits.
4 Is SIS (Single Instance Store) a form of deduplication?

Reducing duplicate file copies is a limited form of deduplication sometimes called single instance storage or SIS. This file level deduplication is intended to eliminate redundant (duplicate) files on a storage system by saving only a single instance of data or a file.

If you change the title of a 2 MB Microsoft Word document, SIS would retain the first copy of the Word document and store the entire copy of the modified document. Any change to a file requires the entire changed file be stored. Frequently changed files would not benefit from SIS. Data deduplication, which reduces sub-file level data, would recognise that only the title had changed - and in effect only store the new title, with pointers to the rest of the document's content segments.
5 What data deduplication rates are expected?

First, redundancy will vary by application, frequency of version capture and retention policy. Significant variables include the rate of data change (few changes mean more data to deduplicate), the frequency of backups (more fulls makes compression effect higher), the retention period (longer retention means more data to compare against), and the size of the data set (more data, more to deduplicate).

When comparing different approaches, be sure to compare with a common baseline. For example, some backup software can offer deduplication, but simultaneously these packages do incrementals-forever backup policies. For high-contrast comparison, they compare their dedupe effect against daily-full-backup policies with very long retention.

The deduplication technology approach and granularity of the deduplication process will also affect compression rates. Data reduction techniques typically split each file into segments or chunks; the segment size varies from vendor to vendor. If the segment size is very large, then fewer segment matches will occur, resulting in smaller storage savings (lower compression rates). If the segment size is very small the ability to find more redundancy in the data increases. Vendors also differ on how to split up the data. Some vendors split data into fixed length segments, while others use variable length segments.
  • Fixed-length segments (also blocks). The main limitation of this approach is that when the data in a file is shifted, for example when adding a slide to a PowerPoint deck, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file, so the compression effect is less significant. Smaller blocks will get better deduplication than large ones, but it will take more processing to deduplicate.
  • Variable-length segments. A more advanced approach is to anchor variable-length segments based on their interior data patterns. This solves the data shifting problem of the fixed-size block approach.
6 What is the difference between inline vs. post-process deduplication?

Inline deduplication means the data is deduplicated before it is written to disk (inline).

Post-process deduplication analyses and reduces data after it has been stored to disk. Inline deduplication is the most efficient and economic method of deduplication. Inline deduplication significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. If replication is supported as part of the inline deduplication process, inline also optimises time-to-DR (disaster recovery) far beyond all other methods as the system does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.

Post-process deduplication technologies wait for the data to land in full on disk before initiating the deduplication process. This approach requires a greater initial capacity overhead than inline solutions. It increases the lag time before deduplication is complete, and by extension, when replication will complete, since it is highly advantageous to replicate only deduplicated (small) data. In practice, it also appears to create significant operational issues, since there are two storage zones, each with policies and behaviours to manage. In some cases, since the redundant storage zone is the default and more important design for some vendors, the dedupe zone is also much less performant and resilient.
7 Is there an advantage to parsing backup formats to deduplicate?

To be application independent and support the broad variety of Nearline applications, it is much more straightforward to work independently of application specific formats. Some vendors go against this trend and are content-dependent. This means they are locked into support of particular backup products and revisions; they parse those formats and create an internal file system, so that when a new file version comes in, they can compare it to its prior entry in its directory and store only the changes, not unlike a version control system for software development.

This approach sounds promising - it could optimise compression tactics for particular data types, for example - but in practice it has more weaknesses than strengths.

First, it is very capital intensive to develop.

Second, it always involves some amount of reverse engineering, and sometimes the format originators are not supportive, so it will never be universal.

Third, it makes it hard to find redundancy in other parts of the originating client space; it only compares versions of files from the same client/file system, and this level of redundancy is much larger than any file-type compression optimisation.

Finally, it is hard to deploy; it can often require additional policy set-up on a per-backup-policy or per-file-type basis. If done right, it is onerous; if done wrong, it will leave a lot of redundancy undeduplicated.
8 How does deduplication improve off-site replication and Disaster Recovery?

The effect deduplication has on replication and disaster recovery windows can be profound.

To start, deduplication means a lot less data needs transmission to keep the DR site up to date, so much less expensive WAN links may be used.

Second, replication goes a lot faster because there is less data to send.

The length of the deduplication process (beginning to end) depends on many variables including the deduplication approach, the speed of the architecture and the DR process. For the most efficient time-to-DR, inline deduplication and replication (inline) of deduplicated data will yield the most aggressive and efficient results. In an inline deduplication approach, replication happens during the backup, significantly improving the time by which there is a complete restore point at the DR site, or improving the time to DR readiness.

Typically less than 1% of a full backup is actually new, unique deduplicated data sequences that can be sent over a WAN immediately upon the start of the backup.

Aggressive cross-site deduplication, when multiple sites replicate to the same destination, can add additional value by deduplicating across all backup replication streams and all local backups. Unique deduplicated segments previously transferred by any remote site, or held in local backup, are then used in the deduplication process to further improve network efficiency by reducing the data to be vaulted. In other words, if the destination system already has a data sequence that came from a remote site or a local backup and that same sequence is created at another remote site, it will be identified as redundant before it consumes bandwidth travelling across the network to the destination system. All of the data collected at the destination site can be safely moved off-site to a single location or multiple DR sites.
9 Is deduplication of data safe?

It's very difficult to harden a storage system so that it has the resiliency that you need to remain operational through a drive failure or a power failure. Find out what technologies the deduplication solution has to ensure data integrity and protection against system failures. The system should tolerate deletions, cleaning, rebuilding a drive, multiple drive failures, power failures - all without data loss or corruption. While this is always important in storage, it is an even bigger consideration in data protection with deduplication. With deduplication solutions, there may be 1,000 backup images that rely on one copy of source data. Therefore this source data needs to be kept accessible and with a high level of data integrity.

While the need is higher for data integrity in deduplication storage, it also offers new opportunities for data verification.
10 How will data deduplication affect my backup and restore performance?

Restore access time will be faster than tape, since it is online and random access. Throughput will vary by vendor. Data deduplication is a resource-intensive process. It during writes, it needs to find whether some new small sequence of data has been stored before, often across hundreds of prior terabytes of data. A simple index of this data is too big to fit in RAM unless it is a very small deployment. So it needs to seek on disk, and disk seeks are notoriously slow (and not getting better).

The easiest ways to make data deduplication go fast are

(1) to be worse at data reduction, e.g. look only for big sequences, so you don't have to perform disk seeks as frequently; and

(2) to add more hardware, e.g. so there are more disks across which to spread the load.

Both have the unfortunate side effect of raising system price, so it becomes less attractive against tape from a cost perspective. Vendors vary in their approaches. Understand:
  • Single stream backup and restore throughput. This is how fast a given file/database can be written, read, or copied to tape for longer-term archiving. The numbers may be different: read speed and write speed may have separate issues. Because of backup windows for critical data, backup throughput is what most people ask about, though restore time is more significant for most service level agreements.
  • Aggregate backup/restore throughput per system. With many streams, how fast can a given controller go? This will help gauge the number of controllers/systems needed for your deployment. It is mostly a measure of system management (number of systems) and cost - single stream speed is more important for getting the job done.
  • Types of data. For example, will large files, such as databases or Exchange stores, go slower than small files? Some deduplication approaches look for simple tricks to increase average performance, e.g. identifying common whole files. These approaches do not work with structured data, which tends to be large. So the easiest big test of a dedupe system is to see what the dedupe throughput is on big database files day over day. In some cases, it will go slow; in others, it will get poor deduplication (e.g. by using a very large fixed segment).
  • Is the 30th backup different from the 1st? If you backup images and delete them over time, does the performance of the system change? Because deduplication uses so many references around the store for new documents, do the recovery characteristics for a recent backup (what you'll mostly be recovering) a month or two into deployment change vs. the first pilot? In a well-designed deduplication system, restore of a new backup should not change significantly a year into deployment. Surprisingly, not all vendors offer this behavioural consistency.
Performance in your deployment will depend on many factors, including the backup software and the systems and networks supporting it.
11 Is deduplication performance determined by the number of disk drives used?

In any storage system, the disk drives are the slowest component. In order to get greater performance it is a common practice to stripe data across a large number of drives so they work in parallel to handle I/O. If the system uses this method to reach performance requirements you need to ask what the right balance between performance and capacity is. This is important since the point of data deduplication is to reduce the number of disk drives. .
12 How much "upfront" capacity does deduplication require?

This is not a question for inline deduplication systems, but it is for a post-process. Post-process methods require additional capacity to temporarily store duplicate backup data.

How much disk capacity is needed may depend on the size of the backup data sets; how many backup jobs you run on a daily basis, and how long the deduplication technology "holds on" to the capacity before releasing it. Post-process solutions that wait for the backup process to complete before beginning to deduplicate will require larger disk caches than those that start the deduplication process during the backup process.
13 What are best practices in choosing a deduplication solution?
  • Ensure ease of integration to existing environment.
  • Get customer references - in your industry.
  • Pilot the product/technology - in your environment.
  • Understand the vendor's roadmap.
Here are some more articles you may be want to take a look at.

storage search banner 1.0" SSDs 1.8" SSDs 2.5" SSDs 3.5" SSDs (c)PCI(e) SSDs rackmount SSDs

STORAGEsearch is published by ACSL