click to visit home page
leading the way to the new storage frontier .....
SSD symmetries article
SSD symmetries ..
high availabaility SSD arrays
HA / FT SSDs ..
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics
SSD power loss ..
SSD myths - write endurance
SSD endurance ..
broken barrel image - click to see the SSD data recovery directory
SSD Recovery ..
Flash Memory
flash & nvm ..

SSD Reliability

by Zsolt Kerekes, editor - - published:- June 25, 2008

See also:- SSD software / SSD controllers / high availability SSDs / rethinking DRAM
SSD Reliability - managing data integrity in mission-critical solid state storage arrays

Multi-terabyte (and now multi-petabyte) solid state storage arrays are seeping into the server environment in the same way that RAID systems did back in the early 1990s.

But just as those RAID pioneers learned that there was a lot more to making a reliable disk array than stuffing a lot of PC hard disks into a box with a fan and a power supply - so too will multi-terabyte SSD users (already on the roadmap to installations of Petabyte SSDs) discover that problems which are undetectable or do no harm in small SSDs can lead to serious data corruption risks when those same SSDs are scaled up without the right architecture and sometimes when it's in place too.

I know from the emails I get that many readers think that once they've looked at the single issue of flash endurance - they've covered covered the bases for enterprise SSDs. While endurance remains a challenge for each new flash SSD generation - it's only a single one of many dimensions in the SSD life mix. That's why (in June 2008) started this directory of definitive technology articles to help guide readers through the reliability maze.

Users with significant storage investments need simple guidelines to help them get the best results from the different types of media they use. That's always been true in the past and will remain so in the future.

A good theoretical understanding of data failure modes is what lies behind the way that mature storage products are designed and managed. But these complex considerations can be translated into simple guides for users.

This SSD reliability collection will provide users with the theoretical justifications they need when they are faced with the difficult economic choices that come from deploying different types of SSDs (with different cost models) in different applications within their organizations.
SSD ad - click for more info

recommended articles and papers about SSD reliability
  • Can you trust your flash SSD specs? - the product which you carefully qualified may not be identical to the one that's going into your production line, because the SSD oem has "improved" it. But the improvement makes another operating parameter - which you deeply care about - unacceptably worse.
  • Flash SSD Data Reliability and Lifetime (pdf) - written by Imation - starting from a description of floating gates and going all the way up to the architecture of a flash SSD this paper includes good descriptions of data failure modes, including:- erase failure, (erase) stress induced long term leakage, disturb faults, and the potential for inadequate error correction code coverage in MLC.
SSD ad - click for more info
  • Why Raw NAND Flash with Hardware-based ECC is the Way to Go - extract - "Error rates are increasing substantially as flash manufacturers push the limits of physics. Errors can be introduced externally by heat or other radiation, during writes or reads of data, and even to data that was successfully written at a different time."
  • Increasing Flash SSD Reliability - this classic article by (published in 2005) remains a good read today. Here's the original editor's intro:-

    SSDs, based on flash technology, have greatly improved in performance in recent years and now compete head to head with RAM based accelerator systems. Flash also has significant advatanges in servers compared to RAM SSDs due to low power consumption. But if you think that all solid state disks which use flash are equally reliable and enduring then think again. That's a bit like saying that a Mercedes 300SL sports coupe is as tough as a Tiger tank because both were made in Germany and both are built out of metal. But as Oddball (Donald Sutherland) says in the movie Kelly's Heroes "I ain't messing with no Tigers." This article by SiliconSystems, shows how their patented architecture cleverly manages the wear out mechanisms inherent in all flash media to deliver a disk lifetime that is about 4 times greater than of other enterprise flash products and upto 100 times greater than intrinsic flash memory.
  • Consistency Groups: The Trouble with Stand-alone SSDs - by Woody Hutsell (published March 2011) discusses different approaches to maintaining data in SSDs in the event of an SSD failure. Some approaches - while simple to implement - can have a large negative impact on performance.
  • Flash Solid State Disks - Inferior Technology or Closet Superstar? - this is another classic article by BiTMICRO (published February 2004).

    This article was one of many pioneering communications from the flash SSD market to get users to think about flash in a different way. Its main message was - "A general perception in the computing industry is that only DRAM is robust enough for enterprise use. That sentiment doesn't give enough credit to flash memory."
  • Flash Memory Failure Analysis (ppt) - (published November 2007) by Purnima Vuggina Intel - outlines common physical causes of failure in flash memory. It also describes the problems of failure analysis, and future challenges.

Most of the papers above talk about flash. Because that's the new technology coming into the datacenter. But don't go away with the idea that RAM SSD arrays don't have data corruption modes too. The difference may be that some long established vendors in this part of the market have been designing products which mitigate these risk factors. But that doesn't mean to say that new market entrants know what they should be doing.

Even big oems can make elementary mistakes which cost billions of dollars of lost sales - as I described in my 2001 article Looking Back on Sun's Cache Memory Problem.
"While RAM can be made insensible to soft errors in many different ways (by design or by software) NVMs are also susceptible to irradiation errors... The lack of any refresh cycle of the stored information make flash memories vulnerable to data loss at each exposure to ionizing radiation even at the amounts which occur at sea level and in terrestrial environments."
...Emanuele Verrelli and Dimitris Tsoukalas, in their chapter called Radiation Hardness of Flash and Nanoparticle Memories - in the multi-author free online book Flash Memories - published in September 2011 by InTech

storage search banner

click to see the SSD reliability  directory
The power grid had been taken out by a falling tree, and the hurricane force wind was too fast to safely operate the turbine. The standby generator had run out of gas and the batteries of the PV array and his notebook had finally run flat. So Megabyte was writing his next SSD article using candle-light while waiting for the logs in the CHP (combined heat and power) burner to get hot enough to generate some steam. Just another regular night at the editorial office.
SSD ad - click for more info
The first phase in the SSD market revolution was when users became aware of the potential benefits of SSDs and when these products reached price points many of them could afford.

The next phase will be when enterprise users move away from a technology focused market (which is what they are being offered by vendors now) towards an applications specific SSD market in which they have to choose which products work best for their own specific deployments.

Users today are faced with the dilemma of paying vastly different price points for products which are superficially similar from the capacity and IOPS point of view - but which may be vastly different in data reliability.

By "data reliability" I don't mean that the SSD has failed - but that some data within the SSD array has been altered or corrupted. (And will continue accumulating data corruptions even if you swap in new replacement drives of the same type.)

The cost of data corruption is different for different applications and in different business applications.

Balancing risk against cost is a decision users make when they choose a supplier - even if they have not consciously analyzed the issues which matter. And choosing a more expensive supplier doesn't protect the user from being mis-sold the wrong type of product.

Many mistakes will be made by vendors and users.

For the next phase in the SSD market revolution to continue momentum users need guidance they can trust to help them navigate the many complex decisions which are beyond performance speedup or power saving considerations.

see also:- How Bad is - Choosing the Wrong SSD Supplier?
bath tub curve is not the most useful way of thinking about PCIe SSD failures - according to a large scale study within Facebook
Editor:- June 15, 2015 - A recently published research study - Large-Scale Study of Flash Memory Failures in the Field (pdf) - which analyzed failure rates of PCIe SSDs used in Facebook's infrastructure over a 4 year period - yields some very useful insights into the user experience of large populations of enterprise flash. Among the many findings:-
  • Read disturbance errors - seem to very well managed in the enterprise SSDs studied.

    The authors said they "did not observe a statistically significant difference in the failure rate between SSDs that have read the most amount of data versus those that have read the least amount of data."
  • Higher operational temperatures mostly led to increased failure rates, but the effect was more pronounced for SSDs which didn't use aggressive data throttling techniques - which could prevent runaway temperatures due to throttling back their write performance.
  • More data written by the hosts to the SSDs over time - mostly resulted in more failures - but the authors noted that in some of the platforms studied - more data written resulted in lower failure rates.

    This was attributed to the fact some SSD software implementations work better at reducing write amplification when they are exposed to more workload patterns.
  • Unlike the classic bathtub curve failure model which applies to hard drives - SSDs can be characterized as having early an warning phase - which comes before an early failure weed out phase of the worst drives in the population and which precedes the onset of predicted endurance based wearout.

    In this aspect - a small percentage of rogue SSDs account for a disproportionately high percentage of the total data errors in the population.
enterprise array reliability study in Facebook
The report contains plenty of raw data and graphs which can be a valuable resource for SSD designers and software writers to help them understand how they can tailor their efforts towards achieving more reliable operation. the article (pdf)
Surviving SSD sudden power loss
Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics If you thought endurance was the end of the SSD reliability story - think again. the article
1.0" SSDs 1.8" SSDs 2.5" SSDs 3.5" SSDs rackmount SSDs PCIe SSDs SATA SSDs
SSDs all flash SSDs hybrid drives flash memory RAM SSDs SAS SSDs Fibre-Channel SSDs is published by ACSL