FITS (failures in time) in SSDs


leading the way to the new storage frontier	.....


SSD controllers	..


SSD reliability	..


high availability SSDs	..


SSD endurance	..

FITs, reliability and abstraction levels in modeling SSDs

and why data architecture superceeds component based analysis

by by Zsolt Kerekes, editor - StorageSearch.com - June 20, 2012

A reader contacted me recently to say he was worried about the viability and reliability of large arrays of SSDs as used in large enterprises.

His email included detailed calculations about FITs (failures in time) related in specialized components in the SSD power management circuits.

It was clear that he knew a lot (than me) about reliability at the electronic component and module level - but I felt in my bones that his extrapolated conclusions were wrong. What was the best way for me to deal with that?

After an interactive email dialog which I won't repeat here because it would fill up too many pages - everything was happily resolved.

I think he was worrying too much because he was extrapolating from a view of SSDs which was not at the appropriate level of modeling SSD behavior for supplying the right answer to his system related concerns. And that's something I've noticed before in reader questions - although in other contexts.

One of the fascinating things for me when I talk to people who really know enterprise SSD design (like company founders or presidents) is how they don't spend long staying on the initial subject of whatever it was we started talking about.

One minute we're in the silicon, then we're fixing a data integrity problem with the host interface, then we see a possible bottleneck in a hardware controller, then we've solved that by doing something in the software stack or splitting it into another piece of specialized silicon. Then another problem is fixed by how these new SSDs can talk across different servers. And what's the best way of looking at the data? - blocks or files or apps? What's the best way to make the SSD legacy neutral? What's the best way to amplify the potential of SSDs by introducing some new APIs?

True enterprise SSD architects are happy bouncing around at different levels of abstraction. And even though each of these is complicated enough by itself - the best way to fix technology problems is to not spend too much time staring at the same place on the mental whiteboard - but to hop across and invent another solution in a different place. The market is buying those solutions now - so it's worth creating them.

That's what makes it hard to predict what will happen next in the SSD market. The recent history of this market has demonstrated many times already that a technological dead end (as predicted by academic researchers) - or something which an analyst says won't happen for a long time (including me and me again reaffirming it won't happen any time soon) is announced as already working and sampling in next week's press release.

That's why we enjoy spending so much time reading about this crazy SSD market.

Going back to where I started with FITs at the SSD component level versus fault tolerance at the SSD system level. I realized this was due to the different perspective of looking at an SSD as an electronic component compared to an SSD as a data systems device.

This is a summary of what I told my reader - who was concerned about SSD FITs at the scale of hundreds or thousands of SSDs.

The most unreliable thing in most SSDs is the flash memory which in modern devices can start out with 10% defective cells and transition to unusability within weeks if it wasn't for some level of reliability architecture.

As long as the reliability architecture can activate redundant or hot standby elements faster than failures occur there are many different levels of FIT at the single module level that can be economic using different designs.

What I mean by that is that you can achieve the same high availability of data at the SSD enterprise level by using a variety of different approaches:-

a large array of inexpensive and not very reliable SSDs - with a suitable cloud / large controller architecture high availability wrapper, or

a small array of expensive and intrinsically more reliable SSDs - with a simple small controller architecture HA wrapper, or

a spectrum of solutions in between the above 2

The above solutions will have different characteristics with respect to peformance, size, cost, electrical power - etc because of their intrinsic components - but you can design fault tolerant SSD arrays for enterprise apps in a myriad of different ways - irrespective of the MTBF of the SSDs inside the array - so long as you can recover, migrate,and transfer the data to enough working elements fast enough.

That's in stark contrast to the analysis for applications which only have a single SSD - for which the component based MTBF analysis methods are valid.

Having said that - if you look inside the design of the best industrial / mil / enterprise designs each SSD drive is actually a complex storage system - which is more complex than most HDD RAID boxes.

By this time I had got to know my reader better - his company Enpirion is in the SSD ecosystem as a supplier of PowerSoC DC to DC converters to SSD oems - and he sent me a pdf which shows some of the SSDs which use these components from his company. That's interesting if you like seeing photos of what's inside SSDs.

I asked - what prompted him to contact me?

He said it was something I had previously said - "The power management system is actually the most important part of the SSD which governs reliability. But many digital systems designers don't give it the scrutiny it deserves."

It often happens that readers when blogging or email pick out better quotes from my articles than I do myself when cross linking them. And then I quietly change my own links to learn from my readers where the true value really was.

more SSD articles

the SSD Heresies - Why can't SSD's true believers agree on a single vision for the future of solid state storage?

SSD utilization ratios and the software event horizon - How will there be enough production capacity of flash memory to replace all enterprise hard drives?

Efficiency as internecine SSD competitive advantage - why do some SSD arrays use twice as many chips to deliver identical products?

Surviving SSD sudden power loss

Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.

image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics

If you thought endurance was the end of the SSD reliability story - think again. ...read the article

All it takes is one broken link.

bath tub curve is not the most useful way of thinking about enterprise SSD failures

PCIe SSD failure stats inside Facebook (June 2015)

New hardware and software fabrics which will have the same effect on how you come to view a single server - as RAID did on the limitations of a single hard drive.

A new scale of ambition in the SSD ecosystem (December 2014)

SSD news

adaptive R/W in SSDs

the Top SSD Companies

Enterprise SSD market silos

What do enterprise SSD users want?

Where are we now with SSD software?

How fast can your SSD run backwards?

Enterprise SSDs - the Survive and Thrive Guide

the Top 100 SSD articles on StorageSearch.com

.....

In the small architecture model - the controller designer does the best job he can to optimize the performance and reliability of the individual SSD.

That's all which can be done. Because it's sold as a single unit and has to work on its own.

When another designer comes along and puts a bunch of these COTS SSDs into an array then these selfsame small architecture SSDs become a mere component inside someone else's next level up controller software.

Size matters in SSD controller architecture

In 2014 Intel reported about 1 SSD recovery / day in itse 100,000 SSD notebook corporate population.

Data Recovery for flash SSDs

Customization is becoming a necessity to achieve optimum system efficiency and product differentiation in all mainstream SSD markets. With thousands of projects now designing new SSDs - and so many untapped application roles there just won't be enough standard SSD controller types around to do what needs to be done.

processors used in SSDs (series overview)