click to visit home page
leading the way to the new storage frontier .....
SSD SoCs controllers
SSD controllers ..
click to see the collection of  SSD reliability articles here on
SSD reliability ..
high availabaility SSD arrays
high availability SSDs ..
SSD myths - write endurance
SSD endurance ..
FITs, reliability and abstraction levels in modeling SSDs

and why data architecture superceeds component based analysis

by by Zsolt Kerekes, editor - June 20, 2012
A reader contacted me recently to say he was worried about the viability and reliability of large arrays of SSDs as used in large enterprises.

His email included detailed calculations about FITs (failures in time) related in specialized components in the SSD power management circuits.

It was clear that he knew a lot (than me) about reliability at the electronic component and module level - but I felt in my bones that his extrapolated conclusions were wrong. What was the best way for me to deal with that?

After an interactive email dialog which I won't repeat here because it would fill up too many pages - everything was happily resolved.

I think he was worrying too much because he was extrapolating from a view of SSDs which was not at the appropriate level of modeling SSD behavior for supplying the right answer to his system related concerns. And that's something I've noticed before in reader questions - although in other contexts.

One of the fascinating things for me when I talk to people who really know enterprise SSD design (like company founders or presidents) is how they don't spend long staying on the initial subject of whatever it was we started talking about.

One minute we're in the silicon, then we're fixing a data integrity problem with the host interface, then we see a possible bottleneck in a hardware controller, then we've solved that by doing something in the software stack or splitting it into another piece of specialized silicon. Then another problem is fixed by how these new SSDs can talk across different servers. And what's the best way of looking at the data? - blocks or files or apps? What's the best way to make the SSD legacy neutral? What's the best way to amplify the potential of SSDs by introducing some new APIs?

True enterprise SSD architects are happy bouncing around at different levels of abstraction. And even though each of these is complicated enough by itself - the best way to fix technology problems is to not spend too much time staring at the same place on the mental whiteboard - but hop across and invent another solution in a different place. The market is buying those solutions now - so it's worth creating them.

That's what makes it hard to predict what will happen next in the SSD market. The recent history of this market has demonstrated many times already that a technological dead end (as predicted by academic researchers) - or something which an analyst says won't happen for a long time (including me and me again) is announced as already working and sampling in next week's press release.

That's why we enjoy spending so much time reading about this crazy SSD market.

Going back to where I started with FITs at the SSD component level versus fault tolerance at the SSD system level. I realized this was due to the different perspective of looking at an SSD as an electronic component compared to an SSD as a data systems device.

This is a summary of what I told my reader - who was concerned about SSD FITs at the scale of hundreds or thousands of SSDs.

The most unreliable thing in most SSDs is the flash memory which in modern devices can start out with 10% defective cells and transition to unusability within weeks if it wasn't for some level of reliability architecture.

As long as the reliability architecture can activate redundant or hot standby elements faster than failures occur there are many different levels of FIT at the single module level that can be economic using different designs.

What I mean by that is that you can achieve the same high availability of data at the SSD enterprise level by using a variety of different approaches:-
  • a small array of expensive and intrinsically more reliable SSDs - with a simple small controller architecture HA wrapper, or
  • a spectrum of solutions in between the above 2
The above solutions will have different characteristics with respect to peformance, size, cost, electrical power - etc because of their intrinsic components - but you can design fault tolerant SSD arrays for enterprise apps in a myriad of different ways - irrespective of the MTBF of the SSDs inside the array - so long as you can recover, migrate,and transfer the data to enough working elements fast enough.

That's in stark contrast to the analysis for applications which only have a single SSD - for which the component based MTBF analysis methods are valid.

Having said that - if you look inside the design of the best industrial / mil / enterprise designs each SSD drive is actually a complex storage system - which is more complex than most HDD RAID boxes.

By this time I had got to know my reader better - his company Enpirion is in the SSD ecosystem as a supplier of PowerSoC DC to DC converters to SSD oems - and he sent me a pdf which shows some of the SSDs which use these components from his company. That's interesting if you like seeing photos of what's inside SSDs.

I asked - what prompted him to contact me?

He said it was something I had previously said - "The power management system is actually the most important part of the SSD which governs reliability. But many digital systems designers don't give it the scrutiny it deserves."

It often happens that readers when blogging or email pick out better quotes from my articles than I do myself when cross linking them. And then I quietly change my own links to learn from my readers where the true value really was.
click here for more info about the Guardian SSD
highest integrity 2.5" military SATA SSDs
with SnapPurge and AES-256 encryption
TRRUST-STOR - from Microsemi

more SSD articles

the SSD Heresies - Why can't SSD's true believers agree on a single vision for the future of solid state storage?

SSD utilization ratios and the software event horizon - How will there be enough production capacity of flash memory to replace all enterprise hard drives?

Efficiency as internecine SSD competitive advantage - why do some SSD arrays use twice as many chips to deliver identical products?
Surviving SSD sudden power loss
Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics If you thought endurance was the end of the SSD reliability story - think again. the article

storage search banner

an SSD reliability article
. All it takes is one broken link.
enterprise array reliability study in Facebookbath tub curve is not the most useful way of thinking about PCIe SSD failures - according to a large scale study within Facebook
Editor:- June 15, 2105 - A recently published research study - Large-Scale Study of Flash Memory Failures in the Field (pdf) - which analyzed failure rates of PCIe SSDs used in Facebook's infrastructure over a 4 year period - yields some very useful insights into the user experience of large populations of enterprise flash.

Unlike the classic bathtub curve failure model which applies to hard drives - SSDs can be characterized as having early an warning phase - which comes before an early failure weed out phase of the worst drives in the population and which precedes the onset of predicted endurance based wearout.

In this aspect - a small percentage of rogue SSDs account for a disproportionately high percentage of the total data errors in the population.
enterprise array reliability study in Facebook
The report contains plenty of raw data and graphs which can be a valuable resource for SSD designers and software writers to help them understand how they can tailor their efforts towards achieving more reliable operation. the article (pdf)
SSD ad - click for more info
re MLC

how does NV become V?

NV + 0.4 DWPD @ 85C = V
Editor:- October 24, 2014 - Even a modest amount of drive writes per day can render modern day MLC flash incapable of retaining data for long in the unpowered state - depending on the temperature in the rack where those writes took place. This effectively means that the flash inside the SSD is no longer "non volatile".

The physics behind this are revealed in a blog by Virtium - a company which operates in the industrial market - and which does a lot of work characterizing memories for use in SSDs and other memory systems. They can leverage that knowledge for customers by adjusting controller and firmware characteristics to optimize the memory's life and data integrity - particularly if it is known in advance what proportion of time the embedded SSD is likely to be operating at particular temperatures.

Virtium's paper - temperature considerations in SSDs (pdf) includes some stark graphs and observations about data retention - which you should be aware of - even if you're not in the industrial market.

Virtium's paper says - "This shows the dramatic effects that temperature has on data retention for given workloads.

"For the same 750 full drive writes (0.4 DWPD drive writes per day for 5 years), SSDs operated and stored at 85C will only have 2 days of data retention, whereas those drives at 40C will have 1 year and those at room temperature 25C will exhibit characteristics of nearly 8 years of data retention." the article (pdf)

adaptive R/W in SSDs

the Top SSD Companies

Enterprise SSD market silos

What do enterprise SSD users want?

Where are we now with SSD software?

How fast can your SSD run backwards?

Enterprise SSDs - the Survive and Thrive Guide

the Top 100 SSD articles on
SSD ad - click for more info

In the small architecture model - the controller designer does the best job he can to optimize the performance and reliability of the individual SSD.

That's all which can be done. Because it's sold as a single unit and has to work on its own.

When another designer comes along and puts a bunch of these COTS SSDs into an array then these selfsame small architecture SSDs become a mere component inside someone else's next level up controller software.
Size matters in SSD controller architecture