about using MLC flash in enterprise SSDs - aka "eMLC" - has moved
on to a new level. The argument is no longer - can MLC can be made to work
reliably? Or how
writes are good enough? It's - who's way of doing - so called -
enterprise MLC tastes best?"|
S / e / brand x / T / M /LC - flash wars
in the enterprise SSD market |
Wars - you can't afford
the risk of a bad
enterprise MLC SSD taste test.
MLC flash in enterprise SSDs - past,
present and future
you're unclear about the differences between MLC and SLC - see
use of flash SSDs in enterprise server acceleration has been hotly and seriously
debated here in the pages of StorageSearch.com since about 2004. The notes
below summarize what those past technical issues were and how things look
In 2004 the typical
the flash memory used inside a 2.5" SLC flash SSD was 100K write cycles.
Today the typical flash memory inside a 2.5" MLC SSD is rated at 3,000
cycles (30x worse). And yet in the same period (2004 to 2011) the
R/W speeds of the fastest
2.5" SATA SSDs have improved 15x putting even more
pressure on already strained endurance. The detail of the debates surrounding
enterprise flash SSD has changed over the years. But all the arguments
revolve around the question of - is this SSD going to be reliable enough in my
Back in 2003 all enterprise acceleration SSDs
were RAM SSDs.
case study by BitMicro - showed that a single 3.5" flash SSD could
provide useful speedup in a 25,000 user server - compared to hard disk based
RAID. But flash SSD makers
weren't active in the enterprise market in those days. They sold to oems and
systems designers. It wasn't economic for flash SSD oems to
educate end users
about SSDs, understand the complexities of user applications and configure the
hot spots in a user server - simply to sell a single SSD.
In 2006 -
SSD makers started shipping small form factor flash SSDs in volume aimed at the
In 2007 - many
SSD makers started shipping rackmount products and small form factor SSDs
specifically into the server acceleration market. Those early SSDs were SLC -
which typically had endurance of 100,000 write cycles per block. By selecting
memory chips and processes - some SSD makers claimed their SLC SSDs could last
In 2008 - there
was a tempation for systems integrators to deploy low cost consumer MLC flash
SSDs in enterprise applications. But MLC endurance was 10x worse than
SLC - and consumer SSD controllers couldn't manage MLC reliably in high IOPS
environments. Some customers found out the hard way when their flash arrays
burned out. The conventional wisdom at the time was - don't use consumer MLC
SSDs in caching / accelerator environments.
In 2009 - a new
wave of SSD companies including
SandForce made waves
in the market with fast high IOPS enterprise SSDs which used consumer grade
MLC flash inside. They said - what made the difference - was the intelligent
management of flash risks inside their architectures.
In 2010 -
some leading flash memory chipmakers started marketing so called "enterprise
grade" MLC flash. This was a formal productizing of high endurance
MLC - achieved by factory processes - to achieve similar ends which some SSD
makers had been doing since 2004 with SLC - that is to say is selecting the
best of breed flash to cream off batches with 10x better than average
In 2011 - you can find at least 3 different types
of flash memory (SLC, consumer grade MLC and enterprise grade MLC) inside fast
enterprise SSDs such as PCIe
SSDs. And the situation could get even more confusing in future with x3 MLC
and other nv memory types possibly appearing in enterprise 2.5" SSDs in
the next year or so too.
The argument is shifting from which type of
flash memory is best? - to whose SSD controller and flash management scheme do
you believe is best?
That makes it harder to evaluate competing
products and make decisons which are safe without
paying more than
you need to.
For more on what's happening today - see these
in Enterprise SSD arrays
looking at the risks posed by a new
generation of MLC Nand Flash SSDs.
classic article - by Zsolt Kerekes, editor, June
|The original purpose of my
article was to show that you needn't worry about wear-out if you use "best
of breed" flash
SSDs with write-endurance on the order of 1 million cycles and above. |
it was first published (in
all flash SSDs in traditional
hard disk form factors
But in the year following publication many
leading SSD oems
STEC ) have also
introduced MLC products too.
To confuse things even more - in June
2008 - Silicon Motion
announced a new family of flash
SSD controllers which
enable oems to mix and match MLC and SLC chips in the same drive - creating in
MLC doubles the capacity of flash memory by interpreting 4 digital
states in the signal stored in a single cell - instead of the traditional
(binary) 2 digital states.
This technique has been commercialized and
proven over many years in hundreds of millions of cell phones and MP3 / iPod
music players - where the theoretical consequence of data corruption (if
anything went wrong with this risky "new" storage technology) was no
more serious than an inaudible sub millisecond sound blip or invisible pixel
SSD market MLC yields much
lower cost storage than SLC with read / write speeds which are nearly as fast
as the best SLC devices.
The manufacturers of first generation "hard
disk replacement" MLC flash SSDs have responsibly classified them as aimed
at the "notebook
market" and by subtle wording differentiated them from their more
pricey "enterprise" products. In the low duty cycle world of a
notebook these MLC SSDs should give a good operating life - typically similar to
the hard disks they replace. (Most SSD marketers would claim their MTBFs are
even better than HDDs).
But there's no way to tell the difference
between SLC and MLC SSDs externally (apart from the model numbers). Put them in
a rackmount system in a datacenter with fast processors which can pump them
continuously close to the maximum speed and what happens?
|It's a simple matter to plug new data for MLCs
into the calculation I did for the worst case wear-out process for flash SSDs -
which I called the Rogue Data Recorder.|
Instead of the 64GB example
I used then, I'll assume the MLC SSD has 128GB capacity. MLC SSDs have
more capacity than SLC. And more capacity means longer operating life - before
cells wear out.
I'll still use the 80M bytes / sec sustained write
speed - because the fastest MLC products (in Feb 2008) can already do that.
(Meanwhile the fastest SLC products have moved up in the world and are about 50%
The next factor is where we hit the big problem... Instead of
a write endurance rating of 2 million cycles (for the best SLC) - I can only use
a figure of 10,000 for MLC. MLC has a much lower rating due to the complex
interaction of discriminating multiple logic levels reliably coupled with the
intrinsic failure mechanism of wear-out.
Plugging these numbers in the
same calculation gives an estimated MLC flash SSD operating life (at max write
throughput) which is 6 months! (instead of 51 years for a 64GB SLC
That's not good enough for a data driven enterprise. There
isn't a wide enough safety margin.
Proponents of MLC might say - can't
you batch select MLC chips for better write endurance in the same way that some
oems do for SLC wear out? - Couldn't that give a figure that is 10x better?
There's not enough data to give a definitive answer - but I suspect
the answer would be no!
The reason is that you would be selecting for
the mutual inclusion of a single chip being inside 2 different probability
curves for what are already secondary characteristics. (Like looking for the
ideal man in
Sex and the
City.) Even in the unlikely event that you could find some devices with the
magic properties to do this - the yield would be small - pushing the cost up
and eliminating the main reason for using MLC.
That's where I thought
this "SLC versus MLC in enterprise SSDs" discussion would end. But
then another factor appeared out of the blue.
Sam Anderson at
EasyCo pointed out to me
that one side effect of their patent pending Managed
Flash Technology is that their software "effectively erases erase
blocks 10 to 100 times less frequently than drives doing traditional
random writes" because it writes address blocks monotonically.
MFT was originally designed to give much faster system IOPS in flash SSD
arrays by using patent pending write algorithms which manage arrays of standard
SSDs in a way which reduces the probability of successive writes to an address
block which is already busy in a time consuming erase/write cycle.
new (to me) attribute of MFT opens up the possibility of yet another generation
of high speed rackmount SSDs with new price points which could be 50x lower
RAM SSDs while being
only 3x slower overall in typical applications.
Some of the papers
listed in the footnotes below cover topics such as Data Retention (which in
gets worse for blocks which have been more frequently erased), and Disturbances
(caused by adjacent R/W operations) - all of which are much more significant
issues for MLC compared to SLC.
I can't give
a definitive answer to the question - Are MLC SSDs Ever Safe in Enterprise
With the current state of technology in 2008 - it depends on the
application and the consequences of data corruption.
I wouldn't risk it
if I were a bank - but I might not mind if my own bank risked it and changed
some pluses to minuses...
Seriously though I hope this article has
shown that there are serious risks inherent in using MLC flash SSDs if they
are not applied correctly.
Some of these risks can be managed by
choosing an SSD array supplier who has qualified and tested their racks with
products from a single known source (because every make of MLC flash SSD has
its own unique failure profile).
I know that despite my warnings - MLC
flash SSDs will get used in some enterprise apps - because the cost
difference (compared to other options) is very attractive.
In my view
using an MLC flash SSD array for an enterprise application without at least
using the (claimed) wear-out mitigating effects of a technology like Easyco's
MFT is like jumping out of a plane without a parachute.
And even with
a parachute - strange things may still happen to wannabe MLC SSD enterprise
pioneers on the way down.
More Articles About Flash SSD Data Integrity
Can you trust your flash
Flash Solid State Disk Reliability
SSD Myths and
Legends - "write endurance"
Challenges in flash SSD Design
CompactFlash Really Created Equal? (pdf)
Flash Disk Reliability
Begins at the IC Level (pdf)
vs. MLC: An Analysis of Flash Memory (pdf)
Inconvenient Truths of NAND Flash Memory (pdf)
State Disk Write Endurance in Database Environments
Unveiling XLC Flash SSD
Technology - spoof article on x4 MLC
|Yes you can! - swiftly
sort the Enterprise SSD buckets|
|If you're trying to
create your first short list of vendors to talk to about how to speed up your
enterprise apps using SSDs - you realize now - with a sinking feeling in your
gut - that maybe delaying the decision for the past several years wasn't such a
good idea after all.|
Because the range of technologies and design
approaches is now so bewildering that you envy your peers in other (richer)
companies who started down the SSD track when the range of solutions was so
Your problem today isn't just that vendors don't seem
to agree about where the best place is to put the SSD or what memory should be
inside it (something I've written about in the
problem is that even when you try to narrow down SSDs to a single interface -
the competing SSD vendors tell a very different story about what their
products will do for you and how much they will
cost. And this
confusing picture isn't simply down to
SSD jargon - which
is bad enough - but you're getting the hang of it. There's something tangibly
different lurking behind those shadowy SSD vendor promises - but you can't
quite put your finger on what it is.
Is there a simple methodology
which - starting from the very first press release you see on the web -
reliably helps you classify all enterprise SSD products - to create
2 distinct groups.
- the SSDs you're not interested in
without the risk that
you may miss out the best choice for your situation - and without having to read
hundreds of articles and reviews?
- the SSDs that might be worth a closer look
SSD articles on StorageSearch.com|
SSD Myths - "write
endurance" - In theory the problems are now well understood - but
solving them presents a challenge for each new chip generation.
SSDs replacing HDDs?
- That's a gross simplification.
the Top 20 SSD companies
- updated quarterly.
the Fastest SSDs - in
each form factor. Speed is still the #1 reason for buying SSDs.
the SSD Buyers Guide
- summarizes key SSD market developments in the past quarter and has a
top level directory of SSD content.
PCIe SSDs -
news and market commentary. We've reported on PCIe SSDs since the first
products shipped in 2007.
history (1976 to 2011) - If you're new to the market it provides a clue to
how much things have changed - and how fast (or how slowly).
SSD controllers & IP
- this is a directory of merchant market SSD controller chip technology
Clarifying SSD Pricing
- where does all the money go? - Also includes SSD price examples.
RAM SSDs - 20 or so
companies still market RAM based SSDs. This directory page tells you who they
are and explains why - as the market uses more flash SSDs - the need for RAM
SSDs is growing (instead of shrinking).
SAS SSDs - includes a
timeline of the SAS SSD market - and lists significant vendors.
the SSD Reliability
Papers - links and abstracts of articles related to the subject of SSD
reliability and data integrity.
the problem with
Write IOPS in flash SSDs - this classic article helps you understand why
all flash SSD benchmarks incorrectly suggest you're going to get much higher
performance from some types of flash SSDs than you will actually see in your
tiering SSDs / SSD ASAPs - market guide to Auto-tuning SSD Accelerated Pools
way to the Petabyte SSD - This article describes the future storage
architecture of the datacenter, explains the economics of SSDs replacing HDDs
for bulk storage and suggests a roadmap for getting there.
the 3 fastest PCIe
flash SSDs list - or is it really lists?
||Spellabyte was preparing a new
magic brew for enterprise flash.
|nice vs naughty flash
The arguments about flash in enterprise SSD accelerators
have changed since this trend started in
First you learned about SLC (good flash).
Then you learned
about MLC (naughty flash when it played in the enterprise - but good enough for
the short attention span of consumers).
Then naughty MLC SSDs learned
how to be good. (When strictly managed.)
But thanks to genetic
alteration some naughty MLC has been bred to be much nicer than others.
(Even when the strict controller isn't looking.) This (extra-good) MLC is
always preceded by an "e" to show it's better. (Like emaill. OK email
vs the pony express -
Postman - kind of mail
which is derogatively called snailmail.)
But other people say you
don't need the expensive "e" in eMLC - because their controllers
empathize better with native naughty flash. (They don't approve of flash
eugenics and they really do care about street bred naughty flash cells being
sent to bad block jail too soon.)
And a new type of naughty flash
which wants to be in with the gang on the enterprise SSD block is TLC (alias
Is your head ready to explode yet?
It's going to get
even more complicated.
Best forget the technical explanations, click
on the ads with the nicest pictures and think of it all as SSD magic.
do you need to allow space for that uncouth MLC flash in your nice clean
It's much cheaper - even when you take into
account the effort of cleaning it up and re-training it than the other kinds
of memory. Even though you still need SLC (good) and RAM (positively angelic)
for ultimate performance.
sudden power loss
enterprise SSD users want?
how fast can your SSD
Challenges in flash SSD Design
MLC flash lives longer in my
SSD care program
SSD types will satisfy all future enterprise needs
Comparisons of SLC, MLC and eMLC (pdf)|
|This white paper - by C.C. Wu - Director -
compares the data integrity of several generations and brands of SLC
flash. and confirms what some SSD makers had been telling me since the early
2000s (as reported in my
article) - which is to say that SLC in reality has often been 5x to
20x better than specified in the original memory chip makers'
The same is not true for MLC - however - where the margins
are closer to 2x (consumer MLC) and 3x (eMLC). This means that
experience of what works with real world SLC
SSD controllers and
flash management - cannot be carried through safely to MLC.
presents greater risks of data loss when
suddenly turned off - which in turn requires better architecture to
recover and automatcially recover and heal the data.
talks to SSD leaders... re flash in enterprise SSDs|
CEO - re MLC in banks.
Over 80% of the SSDs that
Fusion-io has sold in the last couple of years have been MLC rather than SLC -
and David Flynn
thinks that they probably have a bigger base of enterprise MLC SSDs which has
been operating longer in customer sites (upto 3 years) than any other company.
|Texas Memory Systems
- re MLC and RAM SSDs.|
said current consumer grade MLC nand flash has endurance on the order of 3,000
write cycles. ... And the company's burn-in process (done for QA as part of
manufacturing) would use up 10% of the endurance life before the SSD even
reached the customer!
In many bank applications RAM SSDs are actually
cheaper than flash - because of the small size of the data. ...read the article
enterprise MLC flash?|
In July 2010 - a reader (Rob Mantia)
asked - I was wondering what your opinion is on the decision of some SSD
manufacturers to switch to
MLC flash from SLC flash for their enterprise SSDs and if you think eMLC is
as great as they make it sound (less cost, just as reliable) or if you think
Here's what I said.
The view expressed in the original text
of my 2008 article
Are MLC SSDs Ever
Safe in Enterprise Apps? hasn't changed.
users of flash SSDs have to segment their applications for flash into 2 types -
SLC or MLC (and that "MLC" includes eMLC) depending on the mission
criticality and costs associated with the risk of data corruption.
eMLC mitigates just 1 problem (endurance) of the 4 major risk factors associated
with MLC which are significantly worse for MLC than SLC.
The other 3
intrinsic risk factors are
noise immunity - due to much smaller signal change associated with each
- data integrity - due to physical variations across the chips (MLC
poses more problems for R/W-ability even from the outset in a new chip)
So as per my original article...
- temperature sensitivity - if you subject MLC to extreme temperature
fluctuations you may irrecoverably lose data which the ECC cannot bring back.
That's why MLC SSDs aren't used in
military or industrial
MLC is OK for server apps like video streaming (no big deal if
a few pixels change color).
MLC is risky for storing financial data - like derivatives
models and trades.
|doesn't write amplitude
control make MLC safe?|
|In June 2010 - a reader asked if the
comments in the article - Are MLC SSDs Ever Safe in Enterprise Apps? - were
still valid - given that a few years had elapsed since it was written.|
WD Solid State
Storage - which reduce
write amplitude -
fix the problem of low MLC endurance?
Here's what I said.
- but that's only one of the problems with MLC which was identified in
this article. And this has to be reevaluated with each new flash memory
generation - because the difference in intrinsic
between SLC and MLC gets worse with smaller geometries.
What has got better is the strength of the error correction schemes
which hide the magnitude of raw media defects in MLC.
A lot depends on your environment - because temperature cycling lead to charge leakage - and
there isn't much tolerance in MLC cells. That's another reason that all
industrial temperature SSDs are SLC. (No ECC scheme can fix a device which
has redistributed too much charge.)
The issue of EMC compatibility
(discussed in the original article) remains in my mind an intrinsic difference
which no one else in the industry seems to be worrying about. If you don't have
a noisy power rail or ground rail in your app then the EMC may not be
If you have time - a good test would be to do continuous overwriting
of your SSD with randomly changing data - and each time you fill the disk read
back the whole disk and compute a data checksum. Run this for several weeks or
months to qualify a new SSD (or HDD) for a mission critical app.
about EMC compatibility etc in the original article text below...
SSDs More Susceptible to Power Rail Disturbance?|
|As someone who in a past career designed
analog data acquisition products and systems which got right down below the
thermal noise and who cared about the shape and material of PCB tracks I want
to air another concern about the (in)/advisability of using MLC Nand flash
in datacenter applications where there's a lot of power rail disturbance.|
MLC devices have been used in commercial products since 2003 - the products they
have been in (phones and portable music players) have been battery operated
environments where (inside the casing) the environment's overall power rail and
compatibility has been controlled and managed by the system designers who
know enough about these things. And as I say elsewhere in this article - the
consequences of misread data in these applications are trivial.
could say almost the same about the environment for a MLC flash SSD inside a
notebook PC. It's a known, testable environment. Although the user can plug
modules in - they're rarely a high energy disturbance product. The designers
would have tested it with a range of plug-ins, and they've sold millions of
similar notebooks before. There will be few surprises.
An array of
SSDs in a datacenter cabinet is not such a quiet place.
plenty of fast processors all around. Above you - below you. The SSD designer
does not control that space. Every installation is unique.
which you may not be aware of - is that inside an MLC flash chip are
effectively:- a 2 bit anlog to digital converter (ADC) and a 2 bit DAC. Between
each of the 4 logic levels there is also an indeterminate band where the signal
should never be. Power line disturbances are 3x more likely to result in a false
read for MLC than SLC, but the overall error comparison gets worse. There's
also a bigger intrinsic risk (for MLC than in SLC) of an error creeping in with
the initial write charge. SSD designers deal with this by surrounding blocks
of MLC flash data with heavier error detection and correction codes than they
would normally use for SLC.
I found a good detailed discussion of ECC
potential problems in this Denali
Memory ECC: A curiosity for
decades, now essential for MLC NAND flash from which the quote below comes.
the voltage levels closer together for MLC flash the devices are again more
susceptible to disturbs and transient occurrences, causing the generation of
errors which then have to be detected and corrected. If that is not enough for
the chip maker, it poses an even larger problem for the system designer, in that
there is more of a variety of technologies employed among competing flash chip
designs than DRAM makers, for example, would ever dream of."
related discussion about what EMC (not the storage company) can mean for
signal integrity going into a flash SSD see the white paper -
Damping Techniques for PATA SSDs in Military-Embedded Systems (pdf) by
|Flash SSDs are complex systems with a lot
of stuff going on inside.|
Like cars (which use the internal combustion
engine) all flash SSDs from all manufacturers are not the same.
if they have the same capacity and interfaces.
There are many
different process and media management technologies inside a a flash SSD
which oems deal with (or not) in their own proprietary ways. These are just some
of the consequences:-
- best to worst wear leveling algorithms can vary product life by a factor of
3 to 1. (That's not too bad. Some so called "SSDs" - which are
actually dumb flash storage bolted to a disk interface - don't have wear
leveling and should not be used in servers at all.)
- best to worst SLC endurance can vary by 30 to 1.
- SLC to MLC endurance can vary from 10 to 1, upto 300 to 1
Buying flash SSDs for enterprise applications should be regarded as an important
qualifying process. Just as you wouldn't buy a traditional
RAID system without
knowing what type of hard disks were inside it, or without knowing something
about the experience of the vendor in enterprise apps - so too you shouldn't
buy flash SSDs without asking about the factors discussed in this article.
- intrinsic electrical noise susceptibility between SLC and MLC is hard to
quantify - but probably on the order of 10 to 1. Although hidden by wrap
and error detection and correction - the possibility of uncorrectable errors
is still greater in MLC - which is unproven in enterprise environments.
risk for users is that many oems who designed SSD architectures for the notebook
market - will try to capture business in the enterprise market - with the same
(or similar) products without dealing with the datacenter's need for better
resilience and data reliability.
And, sadly, I know from my own inbox
that some SSD marketers don't know how much they don't know about their own
market and how much more advanced their competitors are in the field of
|SLC MLC Hybrid SSDs|
|In June 2008 -
announced a new family of flash SSD controllers which enable oems to mix and
match MLC and SLC chips in the same drive. |
The controller can analyze
the incoming files from the host and intelligently move frequently accessed data
to SLC NAND and non-frequently accessed data to MLC NAND. With this innovative
hybrid architecture, the SSD system cost is significantly reduced to a level
comparable to a pure MLC-based SSD, while endurance is significantly enhanced
and comparable to a pure SLC-based SSD.
However, the intrinsically
higher susceptibility of MLC flash to electrical disturbances remains a risk
factor in such hybrid devices.
|Looking at MLC Writes in
|The clearest description I've seen
explaining the mechanics of MLC flash writes and the problems presented for an
SSD controller are in a white paper by
Flash Controller & Firmware (which I originally linked to - but is no longer
viewable on their site). From which this quote is taken.|
levels might change due to external conditions such as extreme heat or
magnetism. While the cell itself is not damaged permanently, the bit value might
have changed and a read error might occur. Some more recent flashes have the
capability of recognizing the systematic change in behavior or in the voltage
level so that not the difference to a starting reference voltage, but the
inability to differentiate the relative difference to other voltage levels
produces the read errors."
Editor again... Hyperstone's
description of Incremental Step Pulse Programming reminded me of the Intel 2816
(the first commercially available flash chip).
I had an early
sample in about 1981. Programming was done by shaped pulses. The Intel
suggested circuit for doing this didn't work, but it was easy to modify.
Locations were programmed interactively using bursts until you read back the
value you'd written - and then wrote some more bursts for safety. If you wrote
too many pulses that could zap the device. Or it might mean the location was
unusable. Later generations of flash memory hid these details from view. But
once you've seen what happens - past the cloaking effect of a flash memory
controller - you appreciate the delicate balances involved in making a working
flash storage drive.