click to visit StorageSearch.com home page
leading the way to the new storage frontier .....
high availabaility SSD arrays
HA SSDs
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics
SSD power loss ..
SSD myths - write endurance
SSD endurance ..
broken barrel image - click to see the SSD data recovery directory
SSD Recovery ..
image shows mouse dangling from broken link - click for  storage reliability articles and news
storage reliability ..
.....
SSD ad - click for more info
.....

SSD Reliability

managing data integrity in mission-critical solid state storage arrays

(includes a directory of recommended articles and papers about SSD reliability)

by Zsolt Kerekes, editor
Multi-terabyte solid state storage arrays are seeping into the server environment in the same way that RAID systems did back in the early 1990s.

But just as those RAID pioneers learned that there was a lot more to making a reliable disk array than stuffing a lot of PC hard disks into a box with a fan and a power supply - so too will multi-terabyte SSD users discover that problems which are undetectable or do no harm in small SSDs can lead to serious data corruption risks when those same SSDs are scaled up without the right architecture and sometimes with it in place too.

I know from the emails I get that many readers think that once they've looked at the single issue of flash endurance - they've covered covered the bases for enterprise SSDs. While endurance remains a challenge for each new flash SSD generation - it's only a single one of many dimensions in the SSD life mix. That's why (in 2008) StorageSearch.com started this directory of definitive technology articles to help guide readers through the reliability maze.

Users with significant storage investments need simple guidelines to help them get the best results from the different types of media they use. That's always been true in the past and will remain so in the future.

A good theoretical understanding of data failure modes is what lies behind the way that mature storage products are designed and managed. But these complex considerations can be translated into simple guides for users.

This SSD reliability collection will provide users with the theoretical justifications they need when they are faced with the difficult economic choices that come from deploying different types of SSDs (with different cost models) in different applications within their organizations.

Virident FlashMAX.  - click for more info
Predictable, industry-leading PCIe SSD performance.
Scales across diverse workloads, data sets,
and sustains over time.
Learn more about - Virident FlashMAX
...
click here for more info about the Guardian SSD
highest integrity 2.5" military SATA SSDs
with SnapPurge and AES-256 encryption
TRRUST-STOR - from Microsemi

recommended articles and papers about SSD reliability
  • Can you trust your flash SSD specs? - the product which you carefully qualified may not be identical to the one that's going into your production line, because the SSD oem has "improved" it. But the improvement makes another operating parameter - which you deeply care about - unacceptably worse.
  • sweetening for the enterprise - discusses the different flavors of SLC, eMLC and MLC and the competing management schemes which transform unreliable flash memory chips into reliable enterprise storage.
  • Flash SSD Data Reliability and Lifetime (pdf) - written by Imation - starting from a description of floating gates and going all the way up to the architecture of a flash SSD this paper includes good descriptions of data failure modes, including:- erase failure, (erase) stress induced long term leakage, disturb faults, and the potential for inadequate error correction code coverage in MLC.
  • Why Raw NAND Flash with Hardware-based ECC is the Way to Go - extract - "Error rates are increasing substantially as flash manufacturers push the limits of physics. Errors can be introduced externally by heat or other radiation, during writes or reads of data, and even to data that was successfully written at a different time."
  • Increasing Flash SSD Reliability - this classic article by (published in 2005) remains a good read today. Here's the original editor's intro:-

    SSDs, based on flash technology, have greatly improved in performance in recent years and now compete head to head with RAM based accelerator systems. Flash also has significant advatanges in servers compared to RAM SSDs due to low power consumption. But if you think that all solid state disks which use flash are equally reliable and enduring then think again. That's a bit like saying that a Mercedes 300SL sports coupe is as tough as a Tiger tank because both were made in Germany and both are built out of metal. But as Oddball (Donald Sutherland) says in the movie Kelly's Heroes "I ain't messing with no Tigers." This article by SiliconSystems, shows how their patented architecture cleverly manages the wear out mechanisms inherent in all flash media to deliver a disk lifetime that is about 4 times greater than of other enterprise flash products and upto 100 times greater than intrinsic flash memory.
  • Consistency Groups: The Trouble with Stand-alone SSDs - by Woody Hutsell (published March 2011) discusses different approaches to maintaining data in SSDs in the event of an SSD failure. Some approaches - while simple to implement - can have a large negative impact on performance.
  • Flash Solid State Disks - Inferior Technology or Closet Superstar? - this is another classic article by BiTMICRO (published February 2004).

    This article was one of many pioneering communications from the flash SSD market to get users to think about flash in a different way. Its main message was - "A general perception in the computing industry is that only DRAM is robust enough for enterprise use. That sentiment doesn't give enough credit to flash memory."
  • Flash Memory Failure Analysis (ppt) - (published November 2007) by Purnima Vuggina Intel - outlines common physical causes of failure in flash memory. It also describes the problems of failure analysis, and future challenges.

Most of the papers above talk about flash. Because that's the new technology coming into the datacenter. But don't go away with the idea that RAM SSD arrays don't have data corruption modes too. The difference may be that some long established vendors in this part of the market have been designing products which mitigate these risk factors. But that doesn't mean to say that new market entrants know what they should be doing.

Even big oems can make elementary mistakes which cost billions of dollars of lost sales - as I described in my 2001 article Looking Back on Sun's Cache Memory Problem.

storage search banner

click here to see 100 more SSD articles
.
Oceanspace enterprise SSD - click for more info
tier 1 FC SAN SLC SSD storage
Oceanspace Dorado2100
from Huawei Symantec

image shows Z-Drive R4 f- one of the world's fastest PCIe SSDs -  designed by OC
bootable virtualized enterprise PCIe SSDs
3.2TB 2.8GB/s 500K IOPS
the Z-Drive R4 - from OCZ
The first phase in the SSD market revolution was when users became aware of the potential benefits of SSDs and when these products reached price points many of them could afford.

The next phase will be when enterprise users move away from a technology focused market (which is what they are being offered by vendors now) towards an applications specific SSD market in which they have to choose which products work best for their own specific deployments.

Users today are faced with the dilemma of paying vastly different price points for products which are superficially similar from the capacity and IOPS point of view - but which may be vastly different in data reliability.

By "data reliability" I don't mean that the SSD has failed - but that some data within the SSD array has been altered or corrupted. (And will continue accumulating data corruptions even if you swap in new replacement drives of the same type.)

The cost of data corruption is different for different applications and in different business applications.

Balancing risk against cost is a decision users make when they choose a supplier - even if they have not consciously analyzed the issues which matter. And choosing a more expensive supplier doesn't protect the user from being mis-sold the wrong type of product.

Many mistakes will be made by vendors and users.

For the next phase in the SSD market revolution to continue momentum users need guidance they can trust to help them navigate the many complex decisions which are beyond performance speedup or power saving considerations.

see also:- How Bad is - Choosing the Wrong SSD Supplier?
.
Fusion-io fast SSDs - click for more info
world's fastest production PCIe SSD
from Fusion-io
.
Surviving SSD sudden power loss
Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics If you thought endurance was the end of the SSD reliability story - think again. ...read the article
.
"While RAM can be made insensible to soft errors in many different ways (by design or by software) NVMs are also susceptible to irradiation errors... The lack of any refresh cycle of the stored information make flash memories vulnerable to data loss at each exposure to ionizing radiation even at the amounts which occur at sea level and in terrestrial environments."
...Emanuele Verrelli and Dimitris Tsoukalas, in their chapter called Radiation Hardness of Flash and Nanoparticle Memories - in the multi-author free online book Flash Memories - published in September 2011 by InTech

Editor's comments:- that's another reason you need to run a data healing process in the SSD controller task list BTW - not just to fix disturb errors.

An early citation of flash SSD healing here on the mouse site was in my interview with Fusion-io's David Flynn (Dec 2010).
.
1.0" SSDs 1.8" SSDs 2.5" SSDs 3.5" SSDs rackmount SSDs PCIe SSDs SATA SSDs
SSDs all flash SSDs hybrid drives flash memory RAM SSDs SAS SSDs Fibre-Channel SSDs

StorageSearch.com is published by ACSL