|this way to the petabyte SSD, Data Recovery, SSD Backup|
|Editor's intro:-Deduplication is green - but is it safe? Dedupe started out as a simple idea. But it could never stay that way. Vendors offer different approaches each of what they think are best. The complexity of operating in different environments mean that design choices or compromises have to be made. Consequently the simple idea of "dedupe" has spawned many complex interpretations. This ultimate dedupe faqs article gives you a solid starting point for readers looking at this option.|
|1||What is data deduplication? |
Deduplication is similar to data compression, but it looks for redundancy of very large sequences of bytes across very large comparison windows. Long (8KB+) sequences are compared to the history of other such sequences, and where possible, the first uniquely stored version of a sequence is referenced rather than stored again. In a storage system, this is all hidden from users and applications, so the whole file is readable after having been written.
|2||Why deduplicate data? |
Eliminating redundant data can significantly shrink storage requirements and improve bandwidth efficiency. Because primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that new work can re-use old work. Some operations like Backup store extremely redundant information. Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer. In the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data. The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organisation.
|3|| How does data deduplication work?|
Deduplication segments the incoming data stream, uniquely identifies the data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. If the segment is unique, it is stored on disk.
For example, a file or volume that is backed up every week creates a significant amount of duplicate data. Deduplication algorithms analyse the data and can store only the compressed, unique change elements of that file. This process can provide an average of 10-30 times or greater reduction in storage capacity requirements, with average backup retention policies on normal enterprise data. This means that companies can store 10TB to 30TB of backup data on 1TB of physical disk capacity, which has huge economic benefits.
|4||Is SIS (Single Instance Store) a form of deduplication? |
Reducing duplicate file copies is a limited form of deduplication sometimes called single instance storage or SIS. This file level deduplication is intended to eliminate redundant (duplicate) files on a storage system by saving only a single instance of data or a file.
If you change the title of a 2 MB Microsoft Word document, SIS would retain the first copy of the Word document and store the entire copy of the modified document. Any change to a file requires the entire changed file be stored. Frequently changed files would not benefit from SIS. Data deduplication, which reduces sub-file level data, would recognise that only the title had changed - and in effect only store the new title, with pointers to the rest of the document's content segments.
|5||What data deduplication rates are expected? |
First, redundancy will vary by application, frequency of version capture and retention policy. Significant variables include the rate of data change (few changes mean more data to deduplicate), the frequency of backups (more fulls makes compression effect higher), the retention period (longer retention means more data to compare against), and the size of the data set (more data, more to deduplicate).
When comparing different approaches, be sure to compare with a common baseline. For example, some backup software can offer deduplication, but simultaneously these packages do incrementals-forever backup policies. For high-contrast comparison, they compare their dedupe effect against daily-full-backup policies with very long retention.
The deduplication technology approach and granularity of the deduplication process will also affect compression rates. Data reduction techniques typically split each file into segments or chunks; the segment size varies from vendor to vendor. If the segment size is very large, then fewer segment matches will occur, resulting in smaller storage savings (lower compression rates). If the segment size is very small the ability to find more redundancy in the data increases. Vendors also differ on how to split up the data. Some vendors split data into fixed length segments, while others use variable length segments.
|6||What is the difference between inline vs. post-process deduplication?
Inline deduplication means the data is deduplicated before it is written to disk (inline).
Post-process deduplication analyses and reduces data after it has been stored to disk. Inline deduplication is the most efficient and economic method of deduplication. Inline deduplication significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. If replication is supported as part of the inline deduplication process, inline also optimises time-to-DR (disaster recovery) far beyond all other methods as the system does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.
Post-process deduplication technologies wait for the data to land in full on disk before initiating the deduplication process. This approach requires a greater initial capacity overhead than inline solutions. It increases the lag time before deduplication is complete, and by extension, when replication will complete, since it is highly advantageous to replicate only deduplicated (small) data. In practice, it also appears to create significant operational issues, since there are two storage zones, each with policies and behaviours to manage. In some cases, since the redundant storage zone is the default and more important design for some vendors, the dedupe zone is also much less performant and resilient.
|7||Is there an advantage to parsing backup formats to deduplicate? |
To be application independent and support the broad variety of Nearline applications, it is much more straightforward to work independently of application specific formats. Some vendors go against this trend and are content-dependent. This means they are locked into support of particular backup products and revisions; they parse those formats and create an internal file system, so that when a new file version comes in, they can compare it to its prior entry in its directory and store only the changes, not unlike a version control system for software development.
This approach sounds promising - it could optimise compression tactics for particular data types, for example - but in practice it has more weaknesses than strengths.
First, it is very capital intensive to develop.
Second, it always involves some amount of reverse engineering, and sometimes the format originators are not supportive, so it will never be universal.
Third, it makes it hard to find redundancy in other parts of the originating client space; it only compares versions of files from the same client/file system, and this level of redundancy is much larger than any file-type compression optimisation.
Finally, it is hard to deploy; it can often require additional policy set-up on a per-backup-policy or per-file-type basis. If done right, it is onerous; if done wrong, it will leave a lot of redundancy undeduplicated.
|8||How does deduplication improve off-site replication and Disaster
The effect deduplication has on replication and disaster recovery windows can be profound.
To start, deduplication means a lot less data needs transmission to keep the DR site up to date, so much less expensive WAN links may be used.
Second, replication goes a lot faster because there is less data to send.
The length of the deduplication process (beginning to end) depends on many variables including the deduplication approach, the speed of the architecture and the DR process. For the most efficient time-to-DR, inline deduplication and replication (inline) of deduplicated data will yield the most aggressive and efficient results. In an inline deduplication approach, replication happens during the backup, significantly improving the time by which there is a complete restore point at the DR site, or improving the time to DR readiness.
Typically less than 1% of a full backup is actually new, unique deduplicated data sequences that can be sent over a WAN immediately upon the start of the backup.
Aggressive cross-site deduplication, when multiple sites replicate to the same destination, can add additional value by deduplicating across all backup replication streams and all local backups. Unique deduplicated segments previously transferred by any remote site, or held in local backup, are then used in the deduplication process to further improve network efficiency by reducing the data to be vaulted. In other words, if the destination system already has a data sequence that came from a remote site or a local backup and that same sequence is created at another remote site, it will be identified as redundant before it consumes bandwidth travelling across the network to the destination system. All of the data collected at the destination site can be safely moved off-site to a single location or multiple DR sites.
|9||Is deduplication of data safe?|
It's very difficult to harden a storage system so that it has the resiliency that you need to remain operational through a drive failure or a power failure. Find out what technologies the deduplication solution has to ensure data integrity and protection against system failures. The system should tolerate deletions, cleaning, rebuilding a drive, multiple drive failures, power failures - all without data loss or corruption. While this is always important in storage, it is an even bigger consideration in data protection with deduplication. With deduplication solutions, there may be 1,000 backup images that rely on one copy of source data. Therefore this source data needs to be kept accessible and with a high level of data integrity.
While the need is higher for data integrity in deduplication storage, it also offers new opportunities for data verification.
|10||How will data deduplication affect my backup and restore performance?
Restore access time will be faster than tape, since it is online and random access. Throughput will vary by vendor. Data deduplication is a resource-intensive process. It during writes, it needs to find whether some new small sequence of data has been stored before, often across hundreds of prior terabytes of data. A simple index of this data is too big to fit in RAM unless it is a very small deployment. So it needs to seek on disk, and disk seeks are notoriously slow (and not getting better).
The easiest ways to make data deduplication go fast are
(1) to be worse at data reduction, e.g. look only for big sequences, so you don't have to perform disk seeks as frequently; and
(2) to add more hardware, e.g. so there are more disks across which to spread the load.
Both have the unfortunate side effect of raising system price, so it becomes less attractive against tape from a cost perspective. Vendors vary in their approaches. Understand:
|11||Is deduplication performance determined by the number of disk drives
In any storage system, the disk drives are the slowest component. In order to get greater performance it is a common practice to stripe data across a large number of drives so they work in parallel to handle I/O. If the system uses this method to reach performance requirements you need to ask what the right balance between performance and capacity is. This is important since the point of data deduplication is to reduce the number of disk drives. .
|12||How much "upfront" capacity does deduplication require? |
This is not a question for inline deduplication systems, but it is for a post-process. Post-process methods require additional capacity to temporarily store duplicate backup data.
How much disk capacity is needed may depend on the size of the backup data sets; how many backup jobs you run on a daily basis, and how long the deduplication technology "holds on" to the capacity before releasing it. Post-process solutions that wait for the backup process to complete before beginning to deduplicate will require larger disk caches than those that start the deduplication process during the backup process.
|13||What are best practices in choosing a deduplication solution?
|Here are some more articles you may be want to take a look at.|
|STORAGEsearch.com||1.0" SSDs||1.8" SSDs||2.5" SSDs||3.5" SSDs||(c)PCI(e) SSDs||rackmount SSDs|
STORAGEsearch is published by ACSL