published:- June 25, 2008, updated:- May 10, 2013 (new external
SSD Reliability - managing data integrity in
mission-critical solid state storage arrays
solid state storage arrays are seeping into the server environment in the
same way that
RAID systems did back in
the early 1990s.
But just as those RAID pioneers learned that there was a lot more to
making a reliable disk array than stuffing a lot of PC
hard disks into a box
with a fan and a power supply - so too will multi-terabyte SSD users (already
on the roadmap to installations of
discover that problems which are undetectable or do no harm in small SSDs can
lead to serious data corruption risks when those same SSDs are scaled up
without the right
and sometimes with it in place too.
I know from the emails I get that
many readers think that once they've looked at the single issue of
- they've covered covered the bases for enterprise SSDs. While endurance remains
a challenge for each new flash SSD generation - it's only a single one of many
dimensions in the SSD life mix. That's why (in June
StorageSearch.com started this directory of definitive technology articles to
help guide readers through the reliability maze.
significant storage investments need simple guidelines to help them get the
best results from the different types of media they use. That's always been true
in the past and will remain so in the future.
A good theoretical understanding of data failure modes is what lies
behind the way that mature storage products are designed and managed. But these
complex considerations can be translated into simple guides for users.
SSD reliability collection will provide users with the theoretical
justifications they need when they are faced with the difficult economic choices
that come from deploying different types of SSDs (with different cost models)
in different applications within their organizations.
|recommended articles and
papers about SSD reliability|
- Can you trust your
flash SSD specs? - the product which you carefully qualified may not be
identical to the one that's going into your production line, because the SSD oem
has "improved" it. But the improvement makes another operating
parameter - which you deeply care about - unacceptably worse.
SSD Data Reliability and Lifetime (pdf) - written by
Imation - starting
from a description of floating gates and going all the way up to the
architecture of a flash SSD this paper includes good descriptions of data
failure modes, including:- erase failure, (erase) stress induced long term
leakage, disturb faults, and the potential for inadequate error correction
code coverage in MLC.
Raw NAND Flash with Hardware-based ECC is the Way to Go - extract - "Error
rates are increasing substantially as flash manufacturers push the limits of
physics. Errors can be introduced externally by heat or other radiation, during
writes or reads of data, and even to data that was successfully written at a
Flash SSD Reliability - this classic article by (published in
2005) remains a good read today. Here's the original editor's intro:-
based on flash technology, have greatly improved in performance in recent years
and now compete head to head with RAM based accelerator systems. Flash also has
significant advatanges in servers compared to RAM SSDs due to low power
consumption. But if you think that all solid state disks which use flash are
equally reliable and enduring then think again. That's a bit like saying that
a Mercedes 300SL sports coupe is as tough as a Tiger tank because both were
made in Germany and both are built out of metal. But as Oddball (Donald
Sutherland) says in the movie
Heroes "I ain't messing with no Tigers." This article by
shows how their patented architecture cleverly manages the wear out mechanisms
inherent in all flash media to deliver a disk lifetime that is about 4 times
greater than of other enterprise flash products and upto 100 times greater than
intrinsic flash memory.
Groups: The Trouble with Stand-alone SSDs - by Woody Hutsell (published
March 2011) discusses different approaches to maintaining data in SSDs in the
event of an SSD failure. Some approaches - while simple to implement - can have
a large negative impact on performance.
- Flash Solid
State Disks - Inferior Technology or Closet Superstar? - this is another
classic article by BiTMICRO
(published February 2004).
This article was one of many pioneering
communications from the flash SSD market to get users to think about flash in a
different way. Its main message was - "A general perception in the
computing industry is that only DRAM is robust enough for enterprise use. That
sentiment doesn't give enough credit to flash memory."
Memory Failure Analysis (ppt) - (published November 2007) by Purnima Vuggina
Intel - outlines
common physical causes of failure in flash memory. It also describes the
problems of failure analysis, and future challenges.
Most of the papers above talk about flash. Because that's the new
technology coming into the datacenter. But don't go away with the idea that
RAM SSD arrays don't
have data corruption modes too. The difference may be that some long established
vendors in this part of the market have been designing products which mitigate
these risk factors. But that doesn't mean to say that new market entrants know
what they should be doing.
Even big oems can make elementary
mistakes which cost billions of dollars of lost sales - as I described in my
Back on Sun's Cache Memory Problem.
|the SSD Reliability
|The power grid had been taken out by a
falling tree, and the hurricane force wind was too fast to safely operate the
turbine. The standby generator had run out of gas and the batteries of the PV
array and his notebook had finally run flat. So Megabyte was writing his
next SSD article using candle-light while waiting for the logs in the CHP
burner to get hot enough to generate some steam. Just another regular night at
the SSDmouse office.|
|The first phase in the
SSD market revolution was when users became aware of the potential
SSDs and when these products reached price points many of them could afford.|
The next phase will be when enterprise users move away from a technology
focused market (which is what they are being offered by vendors now) towards
an applications specific SSD market in which they have to choose which
products work best for their own
Users today are faced with the dilemma of paying
vastly different price
points for products which are superficially similar from the capacity and
IOPS point of view - but which may be vastly different in data reliability.
By "data reliability" I don't mean that the SSD has
failed - but that some data within the SSD array has been altered or corrupted.
(And will continue accumulating data corruptions even if you swap in new
replacement drives of the same type.)
The cost of data corruption is
different for different applications and in different business applications.
Balancing risk against cost is a decision users make when they choose
a supplier - even if they have not consciously analyzed the issues which matter.
And choosing a more expensive supplier doesn't protect the user from being
mis-sold the wrong type of product.
Many mistakes will be made by
vendors and users.
For the next phase in the SSD market revolution to
continue momentum users need guidance they can trust to help them navigate the
many complex decisions which are beyond performance speedup or power saving
How Bad is - Choosing
the Wrong SSD Supplier?
sudden power loss|
|Why should you care
what happens in an SSD when the power goes down? |
This important design
feature - which barely rates a mention in most SSD datasheets and press releases
- has a strong impact on
SSD data integrity
This article will help you understand why some
SSDs which (work perfectly well in one type of application) might fail in
others... even when the changes in the operational environment appear to be
|"While RAM can be
made insensible to soft errors in many different ways (by design or by
software) NVMs are also susceptible to irradiation errors... The lack of any
refresh cycle of the stored information make flash memories vulnerable to data
loss at each exposure to ionizing radiation even at the amounts which occur at
sea level and in terrestrial environments."|
Verrelli and Dimitris
Tsoukalas, in their chapter called Radiation Hardness of Flash and
Nanoparticle Memories - in the multi-author free online book Flash Memories
- published in September 2011 by InTech
Editor's comments:- that's another reason you need to run
a data healing process in the SSD controller task list BTW - not just to fix
An early citation of flash SSD healing here on the
mouse site was in my
with Fusion-io's David Flynn (Dec 2010).