by
Zsolt Kerekes,
editor - StorageSearch.com
- published:- June 25, 2008
See also:-
SSD software /
SSD controllers /
high availability
SSDs /
rethinking DRAM |
.. |
SSD Reliability - managing
data integrity in mission-critical solid state storage arrays | |
Multi-terabyte (and now
multi-petabyte)
solid state storage arrays are seeping into the server
environment in the
same way that
RAID systems did back in
the early 1990s.
But just as those RAID pioneers learned that there was a lot more to
making a reliable disk array than stuffing a lot of PC
hard disks into a box
with a fan and a power supply - so too will multi-terabyte SSD users (already
on the roadmap to installations of
Petabyte SSDs)
discover that problems which are undetectable or do no harm in small SSDs can
lead to serious data corruption risks when those same SSDs are scaled up
without the right
architecture
and sometimes when it's in place too.
I know from the emails I get
that many readers think that once they've looked at the single issue of
flash endurance
- they've covered covered the bases for enterprise SSDs. While endurance remains
a challenge for each new flash SSD generation - it's only a single one of many
dimensions in the SSD life mix. That's why (in June
2008)
StorageSearch.com started this directory of definitive technology articles to
help guide readers through the reliability maze.
Users with
significant storage investments need simple guidelines to help them get the
best results from the different types of media they use. That's always been true
in the past and will remain so in the future.
A good theoretical understanding of data failure modes is what lies
behind the way that mature storage products are designed and managed. But these
complex considerations can be translated into simple guides for users.
This
SSD reliability collection will provide users with the theoretical
justifications they need when they are faced with the difficult economic choices
that come from deploying different types of SSDs (with different cost models)
in different applications within their organizations. |
. |
 |
. |
recommended
articles and papers about SSD reliability | |
- Can you trust your
flash SSD specs? - the product which you carefully qualified may not be
identical to the one that's going into your production line, because the SSD oem
has "improved" it. But the improvement makes another operating
parameter - which you deeply care about - unacceptably worse.
- Flash
SSD Data Reliability and Lifetime (pdf) - written by
Imation - starting
from a description of floating gates and going all the way up to the
architecture of a flash SSD this paper includes good descriptions of data
failure modes, including:- erase failure, (erase) stress induced long term
leakage, disturb faults, and the potential for inadequate error correction
code coverage in MLC.
|
 |
- Why
Raw NAND Flash with Hardware-based ECC is the Way to Go - extract - "Error
rates are increasing substantially as flash manufacturers push the limits of
physics. Errors can be introduced externally by heat or other radiation, during
writes or reads of data, and even to data that was successfully written at a
different time."
- Increasing
Flash SSD Reliability - this classic article by (published in
2005) remains a good read today. Here's the original editor's intro:-
SSDs,
based on flash technology, have greatly improved in performance in recent years
and now compete head to head with RAM based accelerator systems. Flash also has
significant advatanges in servers compared to RAM SSDs due to low power
consumption. But if you think that all solid state disks which use flash are
equally reliable and enduring then think again. That's a bit like saying that
a Mercedes 300SL sports coupe is as tough as a Tiger tank because both were
made in Germany and both are built out of metal. But as Oddball (Donald
Sutherland) says in the movie
Kelly's
Heroes "I ain't messing with no Tigers." This article by
SiliconSystems,
shows how their patented architecture cleverly manages the wear out mechanisms
inherent in all flash media to deliver a disk lifetime that is about 4 times
greater than of other enterprise flash products and upto 100 times greater than
intrinsic flash memory.
- Consistency
Groups: The Trouble with Stand-alone SSDs - by Woody Hutsell (published
March 2011) discusses different approaches to maintaining data in SSDs in the
event of an SSD failure. Some approaches - while simple to implement - can have
a large negative impact on performance.
- Flash Solid
State Disks - Inferior Technology or Closet Superstar? - this is another
classic article by BiTMICRO
(published February 2004).
This article was one of many pioneering
communications from the flash SSD market to get users to think about flash in a
different way. Its main message was - "A general perception in the
computing industry is that only DRAM is robust enough for enterprise use. That
sentiment doesn't give enough credit to flash memory."
- Flash
Memory Failure Analysis (ppt) - (published November 2007) by Purnima Vuggina
Intel - outlines
common physical causes of failure in flash memory. It also describes the
problems of failure analysis, and future challenges.
Most of the papers above talk about flash. Because that's the new
technology coming into the datacenter. But don't go away with the idea that
RAM SSD arrays don't
have data corruption modes too. The difference may be that some long established
vendors in this part of the market have been designing products which mitigate
these risk factors. But that doesn't mean to say that new market entrants know
what they should be doing.
Even big oems can make elementary
mistakes which cost billions of dollars of lost sales - as I described in my
2001 article
Looking
Back on Sun's Cache Memory Problem. |
. |
"While RAM can
be made insensible to soft errors in many different ways (by design or by
software) NVMs are also susceptible to irradiation errors... The lack of any
refresh cycle of the stored information make flash memories vulnerable to data
loss at each exposure to ionizing radiation even at the amounts which occur at
sea level and in terrestrial environments." |
...Emanuele
Verrelli and Dimitris
Tsoukalas, in their chapter called Radiation Hardness of Flash and
Nanoparticle Memories - in the multi-author free online book Flash Memories
- published in September 2011 by InTech
| | |
. |

| |
 |
The power grid had been taken out by a
falling tree, and the hurricane force wind was too fast to safely operate the
turbine. The standby generator had run out of gas and the batteries of the PV
array and his notebook had finally run flat. So
Megabyte was writing
his next SSD article using candle-light while waiting for the logs in the CHP
(combined heat and power) burner to get hot enough to generate some steam. Just
another regular night at the editorial office. | |
. |
 |
. |
The first phase in the
SSD market revolution was when users became aware of the potential
benefits of
SSDs and when these products reached price points many of them could afford.
The next phase will be when enterprise users move away from a technology
focused market (which is what they are being offered by vendors now) towards
an applications
specific SSD market in which they have to choose which products work
best for their own
specific deployments.
Users today are faced with the dilemma of paying
vastly different price
points for products which are superficially similar from the capacity and
IOPS point of view - but which may be vastly different in data reliability.
By "data reliability" I don't mean that the SSD has
failed - but that some data within the SSD array has been altered or corrupted.
(And will continue accumulating data corruptions even if you swap in new
replacement drives of the same type.)
The cost of data corruption is
different for different applications and in different business applications.
Balancing risk against cost is a decision users make when they choose
a supplier - even if they have not consciously analyzed the issues which matter.
And choosing a more expensive supplier doesn't protect the user from being
mis-sold the wrong type of product.
Many mistakes will be made by
vendors and users.
For the next phase in the SSD market revolution to
continue momentum users need guidance they can trust to help them navigate the
many complex decisions which are beyond performance speedup or power saving
considerations.
see also:-
How Bad is - Choosing
the Wrong SSD Supplier? | |
. |
bath tub curve is not
the most useful way of thinking about PCIe SSD failures - according to a
large scale study within Facebook |
Editor:- June 15, 2015 - A recently published
research study -
Large-Scale
Study of Flash Memory Failures in the Field (pdf) - which analyzed
failure rates of PCIe
SSDs used in Facebook's infrastructure over a 4 year period - yields some
very useful insights into the user experience of large populations of
enterprise flash. Among the many findings:-
- Read disturbance errors - seem to very well managed in the enterprise SSDs
studied.
The authors said they "did not observe a statistically
significant difference in the failure rate between SSDs that have read the
most amount of data versus those that have read the least amount of data."
- Higher operational temperatures mostly led to increased failure rates,
but the effect was more pronounced for SSDs which didn't use aggressive data
throttling techniques - which could prevent runaway temperatures due to
throttling back their write performance.
- More data written by the hosts to the SSDs over time - mostly resulted in
more failures - but the authors noted that in some of the platforms studied -
more data written resulted in lower failure rates.
This was
attributed to the fact some SSD software implementations work better at
reducing write amplification when they are exposed to more workload patterns.
- Unlike the classic bathtub curve failure model which applies to hard drives
- SSDs can be characterized as having early an warning phase - which comes
before an early failure weed out phase of the worst drives in the population
and which precedes the onset of predicted endurance based wearout.
In
this aspect - a small percentage of rogue SSDs account for a disproportionately
high percentage of the total data errors in the population. |
 |
The report contains plenty of raw data and graphs
which can be a valuable resource for SSD designers and software writers to
help them understand how they can tailor their efforts towards achieving more
reliable operation. ...read
the article (pdf) |
| | |
. |
Surviving SSD
sudden power loss |
Why should you care
what happens in an SSD when the power goes down?
This important design
feature - which barely rates a mention in most SSD datasheets and press releases
- has a strong impact on
SSD data integrity
and operational
reliability.
This article will help you understand why some
SSDs which (work perfectly well in one type of application) might fail in
others... even when the changes in the operational environment appear to be
negligible. |
| | |
. |
|
. |
| |