click to visit StorageSearch.com home page
leading the way to the new storage frontier .....
SSD SoCs controllers
SSD controllers ..
click to see the collection of  SSD reliability articles here on StorageSearch.com
SSD reliability ..
high availabaility SSD arrays
high availability SSDs ..
SSD myths - write endurance
SSD endurance ..
..
SSD ad - click for more info
..
FITs, reliability and abstraction levels in modeling SSDs

and why data architecture superceeds component based analysis


by by Zsolt Kerekes, editor - June 20, 2012
A reader contacted me recently to say he was worried about the viability and reliability of large arrays of SSDs as used in large enterprises.

His email included detailed calculations about FITs (failures in time) related in specialized components in the SSD power management circuits.

It was clear that he knew a lot (than me) about reliability at the electronic component and module level - but I felt in my bones that his extrapolated conclusions were wrong. What was the best way for me to deal with that?

After an interactive email dialog which I won't repeat here because it would fill up too many pages - everything was happily resolved.

I think he was worrying too much because he was extrapolating from a view of SSDs which was not at the appropriate level of modeling SSD behavior for supplying the right answer to his system related concerns. And that's something I've noticed before in reader questions - although in other contexts.

One of the fascinating things for me when I talk to people who really know enterprise SSD design (like company founders or presidents) is how they don't spend long staying on the initial subject of whatever it was we started talking about.

One minute we're in the silicon, then we're fixing a data integrity problem with the host interface, then we see a possible bottleneck in a hardware controller, then we've solved that by doing something in the software stack or splitting it into another piece of specialized silicon. Then another problem is fixed by how these new SSDs can talk across different servers. And what's the best way of looking at the data? - blocks or files or apps? What's the best way to make the SSD legacy neutral? What's the best way to amplify the potential of SSDs by introducing some new APIs?

True enterprise SSD architects are happy bouncing around at different levels of abstraction. And even though each of these is complicated enough by itself - the best way to fix technology problems is to not spend too much time staring at the same place on the mental whiteboard - but hop across and invent another solution in a different place. The market is buying those solutions now - so it's worth creating them.

That's what makes it hard to predict what will happen next in the SSD market. The recent history of this market has demonstrated many times already that a technological dead end (as predicted by academic researchers) - or something which an analyst says won't happen for a long time (including me and me again) is announced as already working and sampling in next week's press release.

That's why we enjoy spending so much time reading about this crazy SSD market.

Going back to where I started with FITs at the SSD component level versus fault tolerance at the SSD system level. I realized this was due to the different perspective of looking at an SSD as an electronic component compared to an SSD as a data systems device.

This is a summary of what I told my reader - who was concerned about SSD FITs at the scale of hundreds or thousands of SSDs.

The most unreliable thing in most SSDs is the flash memory which in modern devices can start out with 10% defective cells and transition to unusability within weeks if it wasn't for some level of reliability architecture.

As long as the reliability architecture can activate redundant or hot standby elements faster than failures occur there are many different levels of FIT at the single module level that can be economic using different designs.

What I mean by that is that you can achieve the same high availability of data at the SSD enterprise level by using a variety of different approaches:-
  • a small array of expensive and intrinsically more reliable SSDs - with a simple small controller architecture HA wrapper, or
  • a spectrum of solutions in between the above 2
The above solutions will have different characteristics with respect to peformance, size, cost, electrical power - etc because of their intrinsic components - but you can design fault tolerant SSD arrays for enterprise apps in a myriad of different ways - irrespective of the MTBF of the SSDs inside the array - so long as you can recover, migrate,and transfer the data to enough working elements fast enough.

That's in stark contrast to the analysis for applications which only have a single SSD - for which the component based MTBF analysis methods are valid.

Having said that - if you look inside the design of the best industrial / mil / enterprise designs each SSD drive is actually a complex storage system - which is more complex than most HDD RAID boxes.

By this time I had got to know my reader better - his company Enpirion is in the SSD ecosystem as a supplier of PowerSoC DC to DC converters to SSD oems - and he sent me a pdf which shows some of the SSDs which use these components from his company. That's interesting if you like seeing photos of what's inside SSDs.

I asked - what prompted him to contact me?

He said it was something I had previously said - "The power management system is actually the most important part of the SSD which governs reliability. But many digital systems designers don't give it the scrutiny it deserves."

It often happens that readers when blogging or email pick out better quotes from my articles than I do myself when cross linking them. And then I quietly change my own links to learn from my readers where the true value really was.
.
click here for more info about the Guardian SSD
highest integrity 2.5" military SATA SSDs
with SnapPurge and AES-256 encryption
TRRUST-STOR - from Microsemi
.

more SSD articles

the SSD Heresies - Why can't SSD's true believers agree on a single vision for the future of solid state storage?

SSD utilization ratios and the software event horizon - How will there be enough production capacity of flash memory to replace all enterprise hard drives?

Efficiency as internecine SSD competitive advantage - why do some SSD arrays use twice as many chips to deliver identical products?
.

storage search banner

an SSD reliability article
. All it takes is one broken link.
..
"To avoid obsolescence in military systems, the design team must ensure that the die will perform at extreme temperatures and conditions. Therefore data from external silicon manufacturers isn't assumed to be dependable and instead parts are diligently characterized in sufficient quantities over a wide temperature range."
Michael Flatley, Product Application Manager, Microsemi in his blog - Solve obsolescence problems before they start (September 2013)
..
..
.. suggested SSD articles

adaptive R/W in SSDs

the Top SSD Companies

Enterprise SSD market silos

What do enterprise SSD users want?

Where are we now with SSD software?

How fast can your SSD run backwards?

Enterprise SSDs - the Survive and Thrive Guide

the Top 100 SSD articles on StorageSearch.com
.....
SSD ad - click for more info
.....
Surviving SSD sudden power loss
Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics If you thought endurance was the end of the SSD reliability story - think again. ...read the article
.
SSD ad - click for more info

.
Cyclic Redundancy Check (CRC), which provides "end-to-end" protection can only identify that an error has occurred. It cannot correct it, but it does prevent "silent data corruption."
Data Integrity Challenges in flash SSD Design

.
In the small architecture model - the controller designer does the best job he can to optimize the performance and reliability of the individual SSD.

That's all which can be done. Because it's sold as a single unit and has to work on its own.

When another designer comes along and puts a bunch of these COTS SSDs into an array then these selfsame small architecture SSDs become a mere component inside someone else's next level up controller software.
Size matters in SSD controller architecture

.
Another problem can arise when the original flash chip manufacturer changes something in their process - which doesn't affect the parameters they are testing for - but does change the way the devices look from the SSD data integrity point of view.
bad block management in flash SSDs