 |
| .. |
FITs, reliability and
abstraction levels in modeling SSDs
and why data architecture
superceeds component based analysis
by by Zsolt Kerekes,
editor - June 20, 2012 |
A reader contacted me recently to say he
was worried about the viability and
reliability of
large arrays of SSDs
as used in large
enterprises.
His email included detailed calculations about FITs
(failures in time) related in specialized components in the
SSD power
management circuits.
It was clear that he knew a lot (than me)
about
reliability at the
electronic component and module level - but I felt in my bones that his
extrapolated conclusions were wrong. What was the best way for me to deal with
that?
After an interactive email dialog which I won't repeat here
because it would fill up too many pages - everything was happily resolved.
I think he was worrying too much because he was extrapolating from a
view of SSDs which was not at the appropriate level of modeling SSD
behavior for supplying the right answer to his system related concerns. And
that's something I've noticed before in reader questions - although in other
contexts.
One of the fascinating things for me when I talk to people
who really know enterprise SSD design (like company founders or presidents) is
how they don't spend long staying on the initial subject of whatever it was we
started talking about.
One minute we're in the silicon, then we're
fixing a data integrity problem with the host interface, then we see a possible
bottleneck in a hardware controller, then we've solved that by doing something
in the software stack or splitting it into another piece of specialized silicon.
Then another problem is fixed by how these new SSDs can talk across different
servers. And what's the best way of looking at the data? - blocks or files or
apps? What's the best way to make the SSD legacy neutral? What's the best way to
amplify the potential of SSDs by introducing some new APIs?
True
enterprise SSD architects are happy bouncing around at different levels of
abstraction. And even though each of these is complicated enough by itself - the
best way to fix technology problems is to not spend too much time staring at
the same place on the mental whiteboard - but hop across and invent another
solution in a different place. The market is buying those solutions now - so
it's worth creating them.
That's what makes it hard to predict what
will happen next in the SSD market. The recent history of this market has
demonstrated many times already that a technological
dead end (as
predicted by academic researchers) - or something which an
analyst says won't
happen for a long time (including
me and me
again) is announced
as already working and sampling in next week's press release.
That's
why we enjoy spending so much time reading about this crazy SSD market.
Going
back to where I started with FITs at the SSD component level versus fault
tolerance at the SSD system level. I realized this was due to the different
perspective of looking at an SSD as an electronic component compared to an SSD
as a data systems device.
This is a summary of what I told my
reader - who was concerned about SSD FITs at the scale of hundreds or
thousands of SSDs.
The most unreliable thing in most SSDs is the
flash memory which in
modern devices can start out with 10% defective cells and transition to
unusability within weeks if it wasn't for some level of reliability
architecture.
As long as the reliability architecture can activate redundant or
hot standby elements faster than failures occur there are many different
levels of FIT at the single module level that can be economic using different
designs.
What I mean by that is that you can achieve the same high
availability of data at the SSD enterprise level by using a variety of
different approaches:-
- a large array of inexpensive and not very reliable SSDs - with a suitable
cloud / large controller architecture high availability wrapper, or
- a small array of expensive and intrinsically more reliable SSDs - with a
simple small controller architecture HA wrapper, or
- a spectrum of solutions in between the above 2
The above
solutions will have different characteristics with respect to peformance, size,
cost, electrical power - etc because of their intrinsic components - but you
can design fault tolerant SSD arrays for enterprise apps in a myriad of
different ways - irrespective of the MTBF of the SSDs inside the array - so long
as you can recover, migrate,and transfer the data to enough working elements
fast enough.
That's in stark contrast to the analysis for
applications which only have a single SSD - for which the component based MTBF
analysis methods are valid.
Having said that - if you look inside the
design of the best industrial
/ mil / enterprise designs each SSD drive is actually a complex storage
system - which is more complex than most HDD RAID boxes.
By this time I had got to know my reader better - his company
Enpirion is in the SSD ecosystem as a
supplier of PowerSoC DC to DC converters to SSD oems - and he sent me a
pdf which shows
some of the SSDs which use these components from his company. That's
interesting if you like seeing photos of what's inside SSDs.
I asked -
what prompted him to contact me?
He said it was something I had
previously said - "The power management system is actually the most
important part of the SSD which governs reliability. But many digital systems
designers don't give it the scrutiny it deserves."
It often happens that readers when blogging or email pick out better
quotes from my
articles than I do myself when cross linking them. And then I quietly
change my own links to learn from my readers where the true value really was.
| |
| . |
| see also:- the
SSD Heresies - Why
can't SSD's true believers agree on a single vision for the future of solid
state storage? | |
|

| |
|