the SSD Reliability Papers - StorageSearch.com


leading the way to the new storage frontier	.....


SSD symmetries	..


HA / FT SSDs	..


SSD power loss	..


SSD endurance	..


SSD Recovery	..


flash & nvm	..

.....

SSD Reliability

by Zsolt Kerekes, editor - StorageSearch.com - published:- June 25, 2008

See also:- SSD software / SSD controllers / high availability SSDs / rethinking DRAM

SSD Reliability - managing data integrity in mission-critical solid state storage arrays

Multi-terabyte (and now multi-petabyte) solid state storage arrays are seeping into the server environment in the same way that RAID systems did back in the early 1990s.

But just as those RAID pioneers learned that there was a lot more to making a reliable disk array than stuffing a lot of PC hard disks into a box with a fan and a power supply - so too will multi-terabyte SSD users (already on the roadmap to installations of Petabyte SSDs) discover that problems which are undetectable or do no harm in small SSDs can lead to serious data corruption risks when those same SSDs are scaled up without the right architecture and sometimes when it's in place too.

I know from the emails I get that many readers think that once they've looked at the single issue of flash endurance - they've covered covered the bases for enterprise SSDs. While endurance remains a challenge for each new flash SSD generation - it's only a single one of many dimensions in the SSD life mix. That's why (in June 2008) StorageSearch.com started this directory of definitive technology articles to help guide readers through the reliability maze.

Users with significant storage investments need simple guidelines to help them get the best results from the different types of media they use. That's always been true in the past and will remain so in the future.

A good theoretical understanding of data failure modes is what lies behind the way that mature storage products are designed and managed. But these complex considerations can be translated into simple guides for users.

This SSD reliability collection will provide users with the theoretical justifications they need when they are faced with the difficult economic choices that come from deploying different types of SSDs (with different cost models) in different applications within their organizations.

recommended articles and papers about SSD reliability

storage reliability - includes articles about reliability for all media (not just SSDs)

high availability fault tolerant enterprise SSDs - news and directory about SSDs which are factory built to provide no single point of failure.

Can you trust your flash SSD specs? - the product which you carefully qualified may not be identical to the one that's going into your production line, because the SSD oem has "improved" it. But the improvement makes another operating parameter - which you deeply care about - unacceptably worse.

Flash SSD Reliability (pdf) - explains how Texas Memory Systems have engineered reliability into their multi-terabyte flash SSDs using over-provisioning, RAID, wear-leveling etc.

SSD Myths and Legends - "write endurance" - helped the market re-evaluate one aspect of flash SSD reliability

is eMLC the true successor to SLC in enterprise flash SSD?- which so called "enterprise MLC" tastes the sweetest? How come there are so many different and contradictory reliability claims?

Data Integrity Challenges in flash SSD Design - article by SandForce explains how enterprise SSD designers can reduce the risk of "silent errors."

what happens in SSDs when power goes down? - surveys different designs of power management architectures and discusses their consequences.

Flash SSD Data Reliability and Lifetime (pdf) - written by Imation - starting from a description of floating gates and going all the way up to the architecture of a flash SSD this paper includes good descriptions of data failure modes, including:- erase failure, (erase) stress induced long term leakage, disturb faults, and the potential for inadequate error correction code coverage in MLC.

Noise Damping Techniques for PATA SSDs in Military Systems (pdf) - written by SiliconSystems - useful reminders for PCB designers about good design techniques for electromagnetic compatibility to reduce spurious data corruption in PATA SSDs.

"NAND Flash Solid State Storage for the Enterprise - an in-depth Look at Reliability." (pdf) - published as part of SNIA's SSD initiative - this article is co-authored by:- Jonathan Thatcher Fusion-io, Tom Coughlin Coughlin Associates, Jim Handy Objective Analysis and Neal Ekker Texas Memory Systems.

The article contains the best integrated explanation I've seen of the design trade-offs for error correction schemes and how they affect bit error rates compared to the raw uncorrected results. It goes on to explain the importance of the SSD controller and memory architecture (dispersing data among many chips) and how these can improve data integrity by managing read disturb errors. It also discusses wear-leveling and write amplification which have been well covered elsewhere.

Why Raw NAND Flash with Hardware-based ECC is the Way to Go - extract - "Error rates are increasing substantially as flash manufacturers push the limits of physics. Errors can be introduced externally by heat or other radiation, during writes or reads of data, and even to data that was successfully written at a different time."

Why Consumers Can Expect More Flaky Flash SSDs! - Why is it already so bad? Why will it get even worse?

Increasing Flash SSD Reliability - this classic article by (published in 2005) remains a good read today. Here's the original editor's intro:-

SSDs, based on flash technology, have greatly improved in performance in recent years and now compete head to head with RAM based accelerator systems. Flash also has significant advatanges in servers compared to RAM SSDs due to low power consumption. But if you think that all solid state disks which use flash are equally reliable and enduring then think again. That's a bit like saying that a Mercedes 300SL sports coupe is as tough as a Tiger tank because both were made in Germany and both are built out of metal. But as Oddball (Donald Sutherland) says in the movie Kelly's Heroes "I ain't messing with no Tigers." This article by SiliconSystems, shows how their patented architecture cleverly manages the wear out mechanisms inherent in all flash media to deliver a disk lifetime that is about 4 times greater than of other enterprise flash products and upto 100 times greater than intrinsic flash memory.

Consistency Groups: The Trouble with Stand-alone SSDs - by Woody Hutsell (published March 2011) discusses different approaches to maintaining data in SSDs in the event of an SSD failure. Some approaches - while simple to implement - can have a large negative impact on performance.

Flash Solid State Disks - Inferior Technology or Closet Superstar? - this is another classic article by BiTMICRO (published February 2004).

This article was one of many pioneering communications from the flash SSD market to get users to think about flash in a different way. Its main message was - "A general perception in the computing industry is that only DRAM is robust enough for enterprise use. That sentiment doesn't give enough credit to flash memory."

NAND Flash Memories for Spacecraft (doc) - written by Phil White, President of ECC Technologies is based on a document originally written for NASA and JPL - who have employed these techniques in 2 missions which are already operating in space.

Signal Processing and the evolution of NAND flash memory (pdf) - written by Anobit's chief scientist Naftali Sommer (published December 2010) describes the role of DSP technology in improving the integrity of logic states read from flash cells.

Flash Memory Failure Analysis (ppt) - (published November 2007) by Purnima Vuggina Intel - outlines common physical causes of failure in flash memory. It also describes the problems of failure analysis, and future challenges.

how fast can your SSD run backwards? - 11 Key Symmetries in SSD design will help you understand what lies behind a lot of comparative anomalies in SSD behavior

intrinsic temperature related data rot in nand flash - (published July 2012) and written by WD's Director, Business Development, Eli Tiomkin discusses the physical mechanisms which lead to charge changes and data corruption in SSDs stored at different industrial temperatures.

FITs, reliability and abstraction levels in modeling SSDs - by Zsolt Kerekes, editor of StorageSearch.com (published June 2012) was written in response to readers who were worrying about the wrong aspects of SSD reliability.

Most of the papers above talk about flash. Because that's the new technology coming into the datacenter. But don't go away with the idea that RAM SSD arrays don't have data corruption modes too. The difference may be that some long established vendors in this part of the market have been designing products which mitigate these risk factors. But that doesn't mean to say that new market entrants know what they should be doing.

Even big oems can make elementary mistakes which cost billions of dollars of lost sales - as I described in my 2001 article Looking Back on Sun's Cache Memory Problem.

"While RAM can be made insensible to soft errors in many different ways (by design or by software) NVMs are also susceptible to irradiation errors... The lack of any refresh cycle of the stored information make flash memories vulnerable to data loss at each exposure to ionizing radiation even at the amounts which occur at sea level and in terrestrial environments."

...Emanuele Verrelli and Dimitris Tsoukalas, in their chapter called Radiation Hardness of Flash and Nanoparticle Memories - in the multi-author free online book Flash Memories - published in September 2011 by InTech

click to see the SSD reliability directory

The power grid had been taken out by a falling tree, and the hurricane force wind was too fast to safely operate the turbine. The standby generator had run out of gas and the batteries of the PV array and his notebook had finally run flat. So Megabyte was writing his next SSD article using candle-light while waiting for the logs in the CHP (combined heat and power) burner to get hot enough to generate some steam. Just another regular night at the editorial office.

The first phase in the SSD market revolution was when users became aware of the potential benefits of SSDs and when these products reached price points many of them could afford.

The next phase will be when enterprise users move away from a technology focused market (which is what they are being offered by vendors now) towards an applications specific SSD market in which they have to choose which products work best for their own specific deployments.

Users today are faced with the dilemma of paying vastly different price points for products which are superficially similar from the capacity and IOPS point of view - but which may be vastly different in data reliability.

By "data reliability" I don't mean that the SSD has failed - but that some data within the SSD array has been altered or corrupted. (And will continue accumulating data corruptions even if you swap in new replacement drives of the same type.)

The cost of data corruption is different for different applications and in different business applications.

Balancing risk against cost is a decision users make when they choose a supplier - even if they have not consciously analyzed the issues which matter. And choosing a more expensive supplier doesn't protect the user from being mis-sold the wrong type of product.

Many mistakes will be made by vendors and users.

For the next phase in the SSD market revolution to continue momentum users need guidance they can trust to help them navigate the many complex decisions which are beyond performance speedup or power saving considerations.

see also:- How Bad is - Choosing the Wrong SSD Supplier?

bath tub curve is not the most useful way of thinking about PCIe SSD failures - according to a large scale study within Facebook

Editor:- June 15, 2015 - A recently published research study - Large-Scale Study of Flash Memory Failures in the Field (pdf) - which analyzed failure rates of PCIe SSDs used in Facebook's infrastructure over a 4 year period - yields some very useful insights into the user experience of large populations of enterprise flash. Among the many findings:-

Read disturbance errors - seem to very well managed in the enterprise SSDs studied.

The authors said they "did not observe a statistically significant difference in the failure rate between SSDs that have read the most amount of data versus those that have read the least amount of data."

Higher operational temperatures mostly led to increased failure rates, but the effect was more pronounced for SSDs which didn't use aggressive data throttling techniques - which could prevent runaway temperatures due to throttling back their write performance.

More data written by the hosts to the SSDs over time - mostly resulted in more failures - but the authors noted that in some of the platforms studied - more data written resulted in lower failure rates.

This was attributed to the fact some SSD software implementations work better at reducing write amplification when they are exposed to more workload patterns.

Unlike the classic bathtub curve failure model which applies to hard drives - SSDs can be characterized as having early an warning phase - which comes before an early failure weed out phase of the worst drives in the population and which precedes the onset of predicted endurance based wearout.

In this aspect - a small percentage of rogue SSDs account for a disproportionately high percentage of the total data errors in the population.

enterprise array reliability study in Facebook

The report contains plenty of raw data and graphs which can be a valuable resource for SSD designers and software writers to help them understand how they can tailor their efforts towards achieving more reliable operation. ...read the article (pdf)

Surviving SSD sudden power loss

Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.

image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics

If you thought endurance was the end of the SSD reliability story - think again. ...read the article

1.0" SSDs	1.8" SSDs	2.5" SSDs	3.5" SSDs	rackmount SSDs	PCIe SSDs	SATA SSDs
SSDs all	flash SSDs	hybrid drives	flash memory	RAM SSDs	SAS SSDs	Fibre-Channel SSDs

StorageSearch.com is published by ACSL