click to visit StorageSearch.com home page
leading the way to the new (solid state) storage frontier .....
SSD myths - write endurance
SSD endurance ..
broken barrel image - click to see the SSD data recovery directory
SSD recovery ..
SSD SoCs controllers
SSD controller chips ..
click to read article - sugaring  MLC for the enterprise
e-sugaring MLC ....
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics
SSD power loss ..
click to see the collection of  SSD reliability articles here on StorageSearch.com
SSD reliability ..
Fast Purge flash SSDs directory & articles
Fast Purge SSDs ..
.....
.
"The SSD market isn't a democracy. All SSDs are not created equal.
Not even when they have exactly the same memory chips inside."

principles of bad block management in flash SSDs

by Zsolt Kerekes, editor - November 26, 2010
This is a non technical introduction to the thinking behind bad block management in flash SSDs - which is just one one of the many vital functions performed by an SSD controller.

A lot of reader emails I get show this concept is not widely understood - even by those who are experienced with hard disk and other storage technologies.

I've learned about this by talking to people in the industry. The exact details and algorithms used are proprietary secrets and sometimes covered by patents. But the principles are the same in all SSDs.

In flash devices 2% to 10% of blocks may be error prone or unusable when the device is new.

And after that data in "good blocks" can later be corrupted by charge leakage, disturbance from writes in adjacent parts of the chip, wear-out and variability in the tolerances of the R/W process in MLC SSDs.

Living with these realities and producing reliable storage devices is part of the black magic of the SSD controller - which uses architecture, data integrity, endurance management and othe tricks to ensure reliability.

The explanation below is based on an email I sent to a reader in November 2010.

Controllers remap every time they write to a block - because they try to even out the total writes done on any physical block.

When they get unacceptable errors from a block it's assigned to a dead pool.

For every type of flash chip and each process stepping and each manufacturer - the SSD designer needs to know the percentage of dead blocks which they are likely to get during the life of the SSD. (Typically using a design life of 5 years.)

Successfully working around these defects also depends on the strength of error coding - and how the blocks are mapped on the solid state disk.

Using a RAID aproach and a population of thousands of flash chips in a rackmount SSD like those made by Violin - gives a higher percentage of blocks which can fail and still leave the SSD usable - because data is striped across blocks.

On the other hand - in consumer SSDs with less chips and lower capacity - the striping options are more limited.

The design process results in a bad block budget - for example 4% to 10% - of dead blocks which the SSD can find and yet still operate. Bad blocks are mapped as "do not use". And known good blocks substituted instead. This budget (which is due to media defects) is in addition to the budget which is calculated for attrition of blocks due to wear-out.

The percentage of bad blocks which can be accomodated is a product marketing decision. The spare blocks come from over provisioning inside the SSD and using capacity which is invisible to the host.

If the bad blocks exceed the budgeted number for any reason- the SSD fails.

In the SSD market one of the reasons that some SSDs may have failed early was that SSD designers - who knew too little about what they were doing - used flash chips from other sources than those qualified by the controller manufacturer. That threw away the built in safety margin. Another problem can arise when the original flash chip manufacturer changes something in their process - which doesn't affect the parameters they are testing for - but does change the way the devices look from the data integrity point of view. That too - can tip the balance outside the margins designed into the controller.

Another risk of SSD failures comes from virgin SSD designers who don't know enough about the variance of parameters in the flash chip population. If they choose the bad block budget numbers based on too small a sample - and don't allow enough margin - the controller runs out of spare blocks to assign and dies.

SSDs are only as good as the people who design them and make them. There can be orders of magnitude difference in operational outcomes - even when different SSD makers are using exactly the same memory chips.
click here for more info about the Guardian SSD
highest integrity 2.5" military SATA SSDs
TRRUST-STOR - from Microsemi

References

Most of what I know about this topic comes from dialogs with SSD companies over a period of many years (2003 to 2013). Special thanks to many individuals in these companies:-

Adtron, M-Systems, SandForce, STEC, Texas Memory Systems, Violin Memory and WD Solid State Storage

For those who want to read more about bad blocks in flash SSDs - try these articles.

A detailed overview of flash management techniques (pdf) - give an overview of flash media management and how good data integrity is the result of many different overlapping processes.

Increasing Flash SSD Reliability - although this artice is mainly about endurance - it gives a good insight into how block quality checking and remapping occur as part of the continuous work done by the SSD controller.

Bad Block Management in NAND Flash Memories (pdf) - give you some idea of the internal support in flash chips for data integrity. This is the lowest level in a data integrity heirarchy which is mostly managed by the SSD controller.
.
Surviving SSD sudden power loss
Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics If you thought endurance was the end of the SSD reliability story - think again. ...read the article
.
.

storage search banner

...
image  for this article shows Megabyte the mouse bashing a goblin with a hammer Megabyte used one of Gunnar's goblin
hammers
to stop pesky goblin minions
corrupting his data.
..
nice and naughty flash - SLC, MLC, eMLC & TLC in enterprise SSDs
adaptive R/W flash management IP (including DSP) for SSDs
Data Integrity Challenges in flash SSD Design
flash SSD capacity - the iceberg syndrome
Surviving SSD sudden power loss
SSD's past phantom demons
SSD reliability papers
SSD jargon
.
SSD ad - click for more info
.
"How long before we get to clinical trials?"
...from - MLC flash lives longer in my SSD care program
.
SSD ad - click for more info
SLC industrial SSDs in classic form factors
designed for modern slots
from PCcardsDirect
.
TMS optimizes SSD architecture to cope with flash plane failure
Editor:- May 26, 2011 - a new slant on SSD reliability architectures is revealed today by Texas Memory Systems who explained how their patented Variable Stripe RAID technology is used in their recently launched PCIe SSD card - the RamSan-70.

TMS does a 1 month burn-in of flash memory prior to shipment. (One of the reasons cited for its use of SLC rather than MLC BTW.) Through its QA processes the company has acquired real-world failure data for several generations of flash memory and used this to model and characterize the failure modes which occur in high IOPs SSDs.

Most enterprise SSDs use a simple type of classic RAID which groups flash media into "stripes" containing equal numbers of chips. RAID technology can reconstruct data from a failed Flash chip. Typically, when a chip or part of a chip fails, the RAID algorithm uses a spare chip as a virtual replacement for the broken chip. But once the SSD is out of spare chips, it needs to be replaced.

VSR technology allows the number of chips to vary among stripes, so bad chips can simply be bypassed using a smaller stripe size. Additionally, VSR provides greater stripe size granularity, so a stripe could exclude a small part of a chip rather than having to exclude an entire chip if only part of it failed - "plane error". With VSR technology, TMS says its SSD products will continue operating longer in the installed base.

Dan Scheel, President of Texas Memory Systems explained why their technology increases reliability.

"...Consider a hypothetical SSD made up of 25 individual flash chips. If a plane failure occurs that disables 1/8 of one chip, a traditional RAID system would remove a full 4% of the raw Flash capacity. TMS VSR technology bypasses the failure and only reduces the raw flash capacity by 0.5%, an 8x improvement. TMS tests show that plane failures are the 2nd most common kind of flash device failures, so it is very important to be able to handle them without wasting working flash."

Editor's comments:- by wasting less capacity than simpler RAID solutions - more usable capacity remains available for traditional bad block management.
SSD SoCs controllers This extra capacity comes from the over provisioning budget which figure varies according to each SSD design (as discussed in my recent flash iceberg syndrome article) but is 30% for TMS.
....
SSD ad - click for more info
.
.
a book - Inside NAND Flash
Editor:- November 17, 2010 - Forward Insights (an SSD analyst company) is one of the contributers to a new book called - Inside NAND Flash Memories.

The publishers say that SSD designers must understand flash technology in order to exploit its benefits and countermeasure its weaknesses. The new book is a comprehensive guide to the NAND world - from circuits design (analog and digital) to reliability.
.
SSD Data Recovery Concepts
It's hard enough understanding the design of any single SSD. And there are so many different designs in the market.

Have you ever wondered what it looks like at the other end of the SSD supply chain - when a user has a damaged SSD which contains priceless data with no usable backup?
broken barrel image - click to read this data recovery article If so - this article - written by Jeremy Brock, President, A+ Perfect Computers - who is one of a rare new breed of SSD recovery experts will give you some idea. read the article
.
1.0" SSDs 1.8" SSDs 2.5" SSDs 3.5" SSDs rackmount SSDs PCIe SSDs SATA SSDs
SSDs all flash SSDs hybrid drives flash memory RAM SSDs SAS SSDs Fibre-Channel SSDs

StorageSearch.com is published by ACSL