the SSD reliability
machine learning to manual tuning|
Editor:- July 29, 2016 - Nearly
every SSD in the market today from the smallest SSD on a chip to the
bewildering array of rackmnount systems can be viewed as a choice of how to
select and mix the raw ingredients of SSD IP and integrate them into products
which (for better or worse) match up to and satisfy user needs. How these
decisions are made depends on the DNA of the product marketers, the technology
teams, familiarity and ease of access to some technologies rather than others,
business pressures and timing, the willingness to take risks, and sometimes -
But all products - no matter how complex they appear - can
be analyzed as a specific set of choices made from the architecture and IP
selections which are possible.
In many articles in the past I've shown
you how - whether you're looking at the design of SSDs or systems - there are
rarely more than 2, 3 or 6 raw available decisions which determine each piece
of the jigsaw. And I know from the feedback I get from SSD specifiers and
architects that these simple classifications can be useful in helping to compare
different products and even in choosing which competitive approaches are similar
enough to make comparisons worthwhile.
But when you get down into the
details of implementation at each layer in the product design - every one of
these dimensional options which go into the permutations blender to shape the
total product identity - can itself be complex and multilayered.
the example of the raw magic tuning numbers which enable the raw R/W program,
erase, threshold voltages, shaping and timing parameters inside a flash memory.
The question of how much and when has been at the heart of what makes some SSDs
better than others ever since flash was first used in SSDs.
designers have spent their whole careers measuring and modeling how these
choices interact with the flash cell and can be tweaked to improve speed, power
consumption and reliability. You can get a flavor of this in my article -
and DSP ECC IP.
In a conversation with
NVMdurance's CEO -
earlier this year (April 2016) almost the first thing I did was try to relate
and place the work they were doing within the simple frameworks I'd written
So I asked him how similar it was to something which I
wrote a long article about in
April 2012 - when
SMART announced a
range of SandForce
driven SSDs which had 5x higher endurance - while using exactly the
same industry controller - but using magic tuning numbers which they had
learned from analyzing the adaptive settings from their own adaptive controller
Pearse said - yes - he knew that work. And what NVMdurance was
doing was the same type of thing.
He said that some leading companies
which had the flash talent had done similar things in their proprietary SSDs
Pearse told me that as the complexity of flash increased -
with more layers and TLC - it was becoming harder for designers to manually
(or using human expertise) guarantee they were choosing the optimum magic
numbers - because there were now so many variables involved.
said that what was different about NVMdurance was that they were delivering the
magic numbers based on characterising a sample of typically 100 devices and then
performing machine based simulations to see which numbers would work best -
while also using a multi-stage life cycle model - which was designed to use
different tuning after a fractional amount of the expected endurance had been
As far as he knew from his conversations with memory companies -
no-one else had made the same kinds of investments in this machine intensive
modeling - and that was the key difference - because NVMdurance had a proven
process for delivering good tuning numbers over a variety of memory generations
I hoped at the time that someone would write a paper saying
more about it. Tom
Coughlin has done that.
learning enables longer life high capacity SSDs (pdf) - published this week
describes the background principles and operation of NVMdurance's pathfinder and
plotter software tools and shows you how NVMdurance have tackled this complex
tuning problem to deliver a software delivered IP which can give endurance
results which are similar to adaptive adaptive R/W controllers but which don't
need such expensive processors or such complex run-time firmware. ...read
the article (pdf)
can memory do more?
Editor:- June 17, 2016 - in a new
blog on StorageSearch.com -
I ask - where are we heading with memory intensive systems and software?
the marketing noise coming from the DIMM wars market (flash as RAM and Optane
etc) obscures some important underlying strategic and philosophical questions
about the future of SSD.
When all storage is memory - are there still
design techniques which can push the boundaries of what we assume memory can do?
Can we think of software as a heat pump to manage the entropy
of memory arrays? (Nature of the memory - not just the heat of its data.)
we be asking more from memory systems? ...read the blog
It's not worth paying more for SLC reliability in PCIe SSDs says
Google field study
editor:- February 26, 2016, 2016 - A 6 year
PCIe SSDs used by
Google (spanning millions of drive days and chips from 4 different flash
vendors) concluded that SLC drives were not more reliable than MLC.
An important conclusion re RAS is the importance of being able to map
out bad chips within the SSD architecture. This is because somewhere between
2% to 7% of enterprise PCIe SSDs (depending on where they were used) developed
at least bad chip during the first 4 years in the field - which without such
remapping would necessitate replacing the failed SSD.
The source is -
Reliability in Production: the Expected and the Unexpected (pdf) - by Bianca
Schroeder University of Toronto, Raghav Lagisetty and
Arif Merchant, Google.
This is just one of a set of
which was presented February 22 - 25 , 2016 at the
14th USENIX Conference on
File and Storage Technologies.
Editor's comments:- For more
like this see the news
archive - June 2015 which had a story about a large scale study of PCIe
SSD failures within Facebook.
|Mirabilis discusses role of
deployment level simulation to optimize reliability delivered by SSD
controller design tweaks|
Editor:- August 16, 2015 - "A
diligent system designer can extend the life of an SSD by upto 60% by proper
control of over-provisioning, thus reducing TCO" says Deepak Shankar,
Mirabilis Design in his
the Lifetime of SSD Controllers (pdf) which discusses the role of
application and deployment level simulations to explore the impact of
changing brews in controller
overprovisioning articles 2003 to 2015
bath tub curve is not the most useful way of thinking about
PCIe SSD failures - according to a large scale study within Facebook
June 15, 2015 - A recently published research study -
Study of Flash Memory Failures in the Field (pdf) - which analyzed
failure rates of PCIe
SSDs used in Facebook's infrastructure over a 4 year period - yields some
very useful insights into the user experience of large populations of
Among the many findings:-
- Read disturbance errors - seem to very well managed in the enterprise SSDs
The authors said they "did not observe a statistically
significant difference in the failure rate between SSDs that have read the
most amount of data versus those that have read the least amount of data."
- Higher operational temperatures mostly led to increased failure rates,
but the effect was more pronounced for SSDs which didn't use aggressive data
throttling techniques - which could prevent runaway temperatures due to
throttling back their write performance.
- More data written by the hosts to the SSDs over time - mostly resulted in
more failures - but the authors noted that in some of the platforms studied -
more data written resulted in lower failure rates.
attributed to the fact some SSD software implementations work better at
reducing write amplification when they are exposed to more workload patterns.
- Unlike the classic bathtub curve failure model which applies to hard drives
- SSDs can be characterized as having early an warning phase - which comes
before an early failure weed out phase of the worst drives in the population
and which precedes the onset of predicted endurance based wearout.
this aspect - a small percentage of rogue SSDs account for a disproportionately
high percentage of the total data errors in the population.
report contains plenty of raw data and graphs which can be a valuable resource
for SSD designers and software writers to help them understand how they can
tailor their efforts towards achieving more reliable operation. ...read
the article (pdf)
HDD failure rates analyzed by models
Editor:- May 27,
2015 - The reliability of
hard drives in a
cloud related business
(online backup) is revealed in a new report -
Reliability Stats by Backblaze
which includes results for over 42,000 drives analyzed across 21 drive
The failure distribution in the recent quarter is model
and age specific rather than manufacturer specific - which is to say that you
can't say that Seagate is always better or worse than Western Digital. The table
also gives you insights into drive improvements for this type of application.
Failure rates in the quarter were:-
The data seems to fit in with the
bath tub curve model - with high infant mortality, high failures at the end and
best reliability in the in between periods. ...read the article
- upto 1 year old- worst model - 13%
- 2-3 years - worst model - 27%
- 3 - 4 years -worst model - 3%
- 5 years - -worst model - 32%
high availability enterprise SSD arrays
January 26, 2012 - due to the growing number of oems in the high availability
rackmount SSD market
published a new directory focusing on
HA enterprise SSD
The new directory will make it easier for users to locate
specialist HA SSD vendors, related news and articles.
Pushing data reliability up hard drive hill
July 4, 2011 - Why didn't
hard drives get more
Enterprise users are still replacing hard drives according to cycles that have
haven't changed much since RAID
became common in the 1990s. So why didn't HDD makers do something to make their
Error correction code inventor Phil White -
founder of ECC
Technologies has recently published a
/ blog in which he describes the 25 years of rejections he's had from
leading HDD makers - and the reaons they said they didn't want to use his
patented algorithm - which he says could increase data integrity and the life
of hard drives (and maybe SSDs too.) It makes interesting reading for any other
wannabe inventors out there too. ...read
Phil White's article
But I think another reason for past
rejections might simply have been market economics.
versus the cost of HDDs has improved so much throughout that period - and at
the same time data capacity needs have grown - maybe the user value proposition
didn't make sense.
If you (RAID user) find that all your 5 year old
drives are still working (instead of being replaced) - how much is that really
worth? By now those 5 year old drives might only represent 3% to
10% of the new storage capacity you need anyway. (The reliability
value proposition is different outside service engineer frequented zone - but I
don't want to get side-tracked into
Looking ahead at the future of the HDD market my own
view is that whatever the industry does with respect to reliability won't tip
the balance against
in the enterprise.
The best bet for the future of hard drive
makers is in consumer products where fashion ranks higher up the reason to
buy list than longevity. Most people I know replace their notebook pcs, tvs
and phones not because the old ones have stopped working - but because the new
ones have lifestyle features which make them more desirable.
optimizing SSD architecture to cope with flash plane errors
May 26, 2011 - a new slant on
architectures is revealed today by Texas Memory Systems
who explained how their patented Variable Stripe RAID technology is used in
their recently launched PCIe SSD card - the
does a 1 month burn-in of flash memory prior to shipment. (One of the
reasons cited for its use
of SLC rather than
Through its QA processes the company has acquired real-world failure data
for several generations of flash
memory and used this to model and characterize the failure modes which
occur in high IOPs SSDs.
Most enterprise SSDs use a simple type of
classic RAID which groups
flash media into "stripes" containing equal numbers of chips. RAID
technology can reconstruct data from a failed Flash chip. Typically, when a chip
or part of a chip fails, the RAID algorithm uses a spare chip as a virtual
replacement for the broken chip. But once the SSD is out of spare chips, it
needs to be replaced.
VSR technology allows the number of chips to
vary among stripes, so bad chips can simply be bypassed using a smaller stripe
size. Additionally, VSR provides greater stripe size granularity, so a stripe
could exclude a small part of a chip rather than having to exclude an
entire chip if only part of it failed - "plane error". With VSR
technology, TMS says its SSD products will continue operating longer in the
Dan Scheel, President of Texas Memory Systems explained why their
technology increases reliability.
"...Consider a hypothetical
SSD made up of 25 individual flash chips. If a plane failure occurs that
disables 1/8 of one chip, a traditional RAID system would remove a full 4% of
the raw Flash capacity. TMS VSR technology bypasses the failure and only reduces
the raw flash capacity by 0.5%, an 8x improvement. TMS tests show that plane
failures are the 2nd most common kind of flash device failures, so it is very
important to be able to handle them without wasting working flash."
comments:- by wasting less capacity than simpler RAID solutions - more
usable capacity remains available for traditional
management. This extra capacity comes from the over provisioning budget
which figure varies according to each SSD design (as discussed in my recent
flash iceberg syndrome article) but
is 30% for TMS.
what happens in SSDs when power goes down? - and why you should
Editor:- February 24, 2011 - StorageSearch.com today published
a new article -
SSD power is
going down! - which surveys power down management design factors in
Why should you care what happens in an SSD when the power goes
This important design feature - which barely rates a mention in
most SSD datasheets and press releases - is really important in determining SSD
data integrity and operational reliability. This article will help you
understand why some SSDs which work perfectly well in one type of application
might fail in others... even when the changes in the operational environment
appear to be negligible. If you thought
was the end of the SSD
reliability story - think again. ...read the
Business opportunities from Intel's imperfect bridge chips
February 9, 2011 -
Knowingly Sells Faulty Chipsets. are they Crazy? is a new article on PCWorld.com which discusses how Intel
is dealing with the issue of a bridge chip with known defects in some
rarely read that publication because my interests are enterprise storage and
SSDs - but the author Keir Thomas
had linked to StorageSearch.com from another recent article he wrote -
SSDs are Doomed (at Least for Now) - which showed up in my web stats.
I started my storage
reliability directory in 2006 - I knew that large storage vendors would ship
flaky SSDs and hard
drives - but I assumed that would be due to the unwitting and creeping use of
and testing methodologies
- rather than deliberate business decisions.
characteristic of this Intel chip is that if oems populate all the
RAM slots which it "supports"
- the speed drops down to unattractive levels.
But that's not bad
news for everyone. Adrian Proctor,
VP of of Marketing at Viking
told me last month it means there's a growing population of DIMM slots on
motherboards which can't be used for RAM - but could be used instead to save
space and power by installing their
SSDs to replace HDDs as boot drives. Other companies make
1 inch and smaller SSDs
comparing SSD and HDD failure rates in retail
December 10, 2010 - the failure rates for
hard drives in the
retail channel are compared in a recent article which is part of a regular
feature on the French website HARDWARE.FR.
many consumer SSD
designs have been flaky - the apparent similarities suggested in the
French report should not be taken to be typical of SSDs as a whole.
the contrary - a much bigger difference in field
suggested by the business models of
SSD makers and
server SSD makers for whom better
part of the value proposition - and by anecdotal reports which I've had from
many data recovery
10,000x more reliable than RAID?
Editor:- August 26,
2010 - Amplidata
claims that its
technology is 10,000x more reliable than current
RAID based technologies
and requires 3x less storage.
Is another new way of fixing
problems in hard disk
arrays worth the effort just as we approach the end of the
hard disk market's life?
- I doubt it. See why in -
this way to the
how to make "SSD reliability" believable - marketing
Editor:- July 29, 2010 - StorageSearch.com today published
a new article -
the cultivation and
nurturing of "reliability" in a 2.5" SSD brand.
Reliability is an
important factor in many applications which use
SSDs. But can you trust an
SSD brand just because it claims to be reliable?
As we've seen in
recent years - in the rush for the
SSD market bubble -
many design teams which previously had little or no experience of SSDs were
tasked with designing such products - and the result has been successive waves
of flaky SSDs and
SSDs whose specifications
couldn't be relied on to remain stable and in many products quickly
degraded in customer sites.
As part of an education series for SSD
product marketers - this new case study describes how one company - which didn't
have the conventional background to start off with - managed to equate their
brand of SSD with reliability in the minds of designers in the embedded systems
Anobit aims at SandForce SSD SoCs slots
June 15, 2010 - Anobit
announced it is sampling
SSDs based on its patented Memory
Signal Processing technology which provide 20x improvement in operational
life for MLC SSDs in high IOPS server environments.
proprietary algorithms that compensate for the physical limitations of NAND
flash, Anobit's MSP technology extends standard MLC
from approximately 3K read/write cycles to over 50K cycles - to make MLC
technology suitable for high-duty cycle applications. This guarantees drive
write endurance of 10 full disk writes per day, for 5 years, or 7,300TBs
for a 400GB drive, with fully random data (worst-case conditions).
First-generation Anobit Genesis SSDs deliver 20,000 IOPS random
write and 30,000 IOPS random read, with 180MB/s sustained write and 220MB/s
Anobit says that some of the world's largest NAND
manufacturers, consumer electronics vendors and storage solution providers
currently utilize Anobit's MSP technology in their products.
"For too long, the high prices of SLC SSDs and
MLC SSD endurance have slowed the adoption of
flash memory storage in
the enterprise. Anobit Genesis SSDs effectively neutralize both of these
concerns," said Prof. Ehud Weinstein, Anobit CEO. "By delivering true
reliability at affordable MLC SSD prices, Anobit Genesis SSDs unlock the
full promise of solid-state enterprise storage."
Editor's comments:- superficially the endurance delivered by
Anobit's SSD controller
is better than that obtainable from
SandForce - whereas
the performance lead is the other way around. For most oems what will be more
important is that they do not need to be locked into a single technology
supplier to get adequate metrics for their MLC SSD product lines.
flash SSD integrity architectures for space-craft
April 13, 2010 - for those interested in
flash SSD data
integrity issues - Phil White, President of ECC Technologies has
released a white paper -
Memories for Spacecraft (doc).
Phil has been working with ECC for
almost 37 years and his company is developing future ECC designs to
allow systems architects to develop
NAND flash memories that
are highly reliable
and fault-tolerant even if the NAND flash chips themselves are not so reliable.
NASA is using ECC Tek's designs in
multiple missions. 2 of the designs are in space at the present time and are
working perfectly. Phil White recently wrote a document for NASA and
JPL which outlines how to design NAND
Flash memories for spacecraft. The 22 page "preview" document
excludes confidential data but gives a taste of the technology available for
licensing. ...read the
XLC promises "enterprise" hybrid x4 SSDs
April 1, 2010 - XLC
Disk announced details of a paper it will discuss later this month
at the NV
Memories Worskhop (UC San Diego) called - "Paramagnetic Effects on
Trapped Charge Diffusion with Applications for x4 Data Integrity."
company says its findings could have applications in the enterprise storage
market by solving the data integrity problems in x4 MLC SSDs within a new class
of hybrid storage drives. ...read
New Integrity Tool for Old Tape Archives
January 18, 2010 - Crossroads
details of ArchiveVerify - a new monitoring option for its
that safeguards the future readability of data
backed up on
"In our experience, the Achilles' heel of a data recovery
strategy is often the uncertainty of the data's readability, and this single
point of failure can render then entire restore process useless," adds
Bernd Krieger, Managing Director, at Crossroads Europe.
Editor's comments:- Crossroads was originally a specialist in
the SAN router business.
In recent years it has done a lot of work in the area of
I've read lots of their whitepapers which describe their research and products
addressing data integrity. Although there has been a historic trend for users
to migrate away from
tape to disk backup - many super users of huge
tape libraries (with the
biggest archives) will be the last to migrate away - due to logistics and cost.
It's those kind of users who can benefit most from automated tools or services
which increase the data integrity they achieve and cut down media waste and
New article - Data Integrity Challenges in flash SSD Design
October 12, 2009 - StorageSearch.com
today published a new article called -
Challenges in flash SSD Design - written by Kent Smith Senior
Director, Product Marketing, SandForce.
bursting onto the SSD scene
in April 2009,
SandForce has achieved remarkably
high reader popularity.
How did a company whose business is designing
achieve this? - especially when the direct market for its products today
numbers less than 1,000 oems.
The answer is - that if you want to know
what the future of 2.5"
enterprise SATA SSDs might look like -you have to look at the
leading technology cores that will affect this market. Even if you're not
planning to use SandForce based products yourself - you can't afford to ignore
them - because they are setting the agenda in this market.
Reliability is the
next new thing
for SSD designers and users to start worrying about. A common theme you will
hear from all fast SSD
companies is that the faster you make an SSD go - the more effort you
have to put into understanding and engineering data integrity to eliminate the
risk of "silent errors." ...read the article
Real World Reliability in High Performance Storage
August 20, 2009 - Density Dynamics
published a whitepaper called -
World Reliability in High Performance Storage (pdf).
real world failure rates for
flash SSDs with
predicted MTBF and
data and suggests that the big discrepancies reported by users are due to the
nature of their workloads. In this respect it suggests
RAM SSDs are better in
apps - even taking into account the MTBFs of batteries and UPS like components.
It also cites my own article
RAM Cache Ratios
in flash SSDs.
Why Consumers Can Expect More Flaky Flash SSDs!
August 10, 2009 - a
new article published
today on StorageSearch.com
explains why the consumer flash SSD quality problem is not going to get
better any time soon.
You know what I mean. Product recalls,
firmware upgrades, performance downgrades and bad behavior which users did
not anticipate from reading glowing magazine product reviews. And that's if
they can get hold of the new products in the first place.
this unreliability scenario many years ago. And you have to get used to it.
The new article explains why it's happening and gives some suggested
workarounds for navigating in a world of imperfect flash SSD product
Ramtron's F-RAM Casualty of Auto Market Crash
May 7, 2009 - Ramtron
said its revenue
26% in the 1st quarter of 2009 compared to the year ago period.
sharp decline in orders from the automotive market was cited as a principal
Ramtron also announced an update on a legal suit related to
in-field failures of one of its F-RAM memory products in an unspecified
application. (In July 2008 Ramtron confirmed that specific batches of product
had failed due to manufacturing
defects in one of its partners fabs.)
Ramtron also announced
today that, over the next 2 years, it will transition the manufacturing of
products that are currently being built at Fujitsu's chip foundry located in
Iwate, Japan to its foundry at Texas Instruments in Dallas, Texas and to its
newest foundry at IBM Corp in Essex Junction, Vermont.
Why You Need Better ECC Inside the SSD
16, 2009 - this week SandForce
published an article on the subject of effective
error correction in flash
I like it because it resonates well with the thinking that
led me to publish this reliability page 3 years ago.
At that time - I
was concerned with the theoretical inadequacy of error correction used inside
hard drives. (Something
which has since been confirmed in practice and reported in some of the papers
cited at the top of this page.)
SandForce's short article shows you the
consequences - in terms of uncorrectable errors - if you use "industry
standard" strength ECC. And that's part of the sales pitch for their 10-to-the-minus-something-better
errors protection in their new SSD controller.
How Good SSD Controllers Manage Flash Data Integrity
April 3, 2009 - SNIA
has published a new white paper -
Flash Solid State Storage for the Enterprise - an in-depth Look at Reliability."
It's co-authored by:- Jonathan Thatcher
Associates, Jim Handy
Analysis and Neal Ekker
Texas Memory Systems.
article contains the best integrated explanation I've seen of the design
trade-offs for error correction schemes and how they affect bit error rates
compared to the raw uncorrected results. It goes on to explain the
importance of the SSD controller and memory architecture (dispersing data
among many chips) and how these can improve data integrity by managing read
disturb errors. It also discusses wear-leveling and write amplification which
have been well covered elsewhere. ...read
SSD Reliability -
Understanding Data Failure Modes in Large Solid State Storage Arrays
SSD Bookmarks from Texas Memory Systems
March 16, 2009 - Texas
Memory Systems' President, Woody Hutsell - shares his
with readers of
who know the SSD industry well, mostly think of TMS as a company which makes
very fast SSDs for accelerating
SAN resident applications.
But in the many discussions I've had with Woody Hutsell during the past decade -
"reliability" has also been a frequent topic in our conversations.
That's because when you manufacture products which pack more memory chips than
anyone else has ever put into a single box - all those "10 to the minus
something" numbers which relate physics to semiconductor memory
effects - add up to design problems which are far from theoretical. TMS has
been engineering solid state storage systems for
So I was not surprised to see an in depth paper about reliability being one of
the articles in this
list of bookmarks.
New Tool Acts as Bouncer for Up Market Tape Joints
Colo. - February 3, 2009 - Spectra Logic has extended its Media
Lifecycle Management technology outside the library with a new reader - now
The MLM Reader (approx $2,500) is a portable device
that allows customers to check tape health on any computer through
USB, without loading the
tape into a library, and
is designed to proactively identify faulty tape media before it is required for
a data restore. It tracks over 30 non-volatile statistics about data tapes,
such as export details; remaining capacity; encryption information; number of
reads and writes; date of last access; born-on date; and cleaning log. ...Spectra Logic profile
SiliconSystems Proposes New Methodology for Realistically
Predicting Flash SSD Reliability
December 15, 2008 - Gary Drossel, VP Product Planning at SiliconSystems
has written a new article - "NAND Evolution and its Effects on SSD
This is probably one of the 3 most
significant articles on the subject of
reliability which have been published in recent years. Starting with a tour of
the state of the art in the flash SSD market and technology the paper
introduces several new concepts to help systems designers understand why
current wear usage models don't give a complete picture.
- Write amplification - is a measure of the efficiency of the SSD
controller. Write amplification defines the number of writes the controller
makes to the NAND for every write from the host system.
The paper discusses
the theoretical expected lifetimes and amplification factors for several
applications and concludes that measurement of wear-out in real applications is
the best way to understand what is happening. It suggests that systems
designers can use the company's SiliconDrive (which includes real-time on-chip
endurance monitoring) as an endurance analysis design tool. By simply
plugging in SiliconDrive(s) to a new application for a day, week or month - the
percentage of wear-out can be measured - and corrective steps taken (in software
design or overprovisioning) to correct reliability problems.
- Wear-leveling efficiency - reflects the maximum deviation of the
most-worn block to the least worn block over time.
isn't stated in the article - but is a logical inference - is that even if
your product design goal is to buy SSDs from other oems - the SiliconDrives can
be used in your design process to capture information in a non invasive manner
which is difficult or impossible to collect using other instrumentation. ...read the
article (pdf), ...SiliconSystems
iStor Unlocks High Availability Features in Installed iSCSI ASICs
IRVINE, Calif. - October 7, 2008 -
iStor Networks, Inc. has begun shipping a new version of its
software, v2.5, as a no-cost upgrade for all its iSCSI storage solutions.
This software will provide dual-controller
iS512 systems with the
ability to automatically detect malfunctions in the operational controller and
to switch to the redundant controller without loss of data, function or
"This new software capitalized on the patented
capabilities of iStor's ASIC technology enabling HA capability with no
impact upon system performance before, during or after a controller failure."
said Jim Wayda, iStor's VP of Software Development. "iStor designed its
controllers from the very beginning to deliver advanced functionality such as HA
and we are very proud that we have been able to demonstrate the investment
protection inherent in iStor's approach of implementation..."
Can You Trust Your Flash SSD's Specs?
Editor:- July 9, 2008 -
STORAGEsearch.com today published a new article which asks - Can you
trust your flash SSD specs?
flash SSD market
opens up tremendous opportunities for systems integrators to leverage solid
state disk technology. But due to the diversity of products in the market and
lack of industry standards - it's got tremendous risks as well.
product which you carefully qualified may not be identical to the one that's
going into your production line for a variety of reasons... ...read the article
Preparing for the Next Phase in the SSD Market Revolution
Editor:- June 25, 2008 -STORAGEsearch.com
today called for new papers on the theme - "Understanding Data
Failure Modes in Large Solid State Storage Arrays".
solid state storage arrays are seeping into the server environment in the same
way that RAID systems did
back in the early 1990s.
But just as those RAID pioneers learned that
there was a lot more to making a reliable disk array than stuffing a bunch of PC
hard disks into a
box with a fan and a
power supply - so too will multi-terabyte
SSD users discover that
problems which are undetectable or do no harm in small SSDs can lead to
serious data corruption risks when those same SSDs are scaled up without the
right architecture and sometimes with it in place too.
I know from
the emails I get that many readers think that once they've looked at the
single issue of flash
endurance - they've covered covered the bases for enterprise SSDs.
why storagesearch.com is planning to publish a collection of definitive
technology articles to help guide the industry through this risky transition
The new articles will provide users with the theoretical
justifications they need when they are faced with the difficult economic choices
that come from deploying different types of SSDs (with different cost models)
in diverse applications within their organizations. ...read the article
Disk Error Correction Company Gets $22 million Funding
Santa Clara, Calif. - April 9,
2008 - Link_A_Media Devices Corp secured $22 million in Series
The funding round, led by
AIG SunAmerica Ventures,
was secured from 4 additional financial and corporate investors -
Link_A_Media Devices is developing a new class of
chip controller resident
data recovery solutions for
SSDs. These are
designed to exceed the performance of conventional methods deployed in
peripheral storage devices, as well as provide adaptive features that can be
used during manufacturing to improve drive yields and product margins.
Editor's comments:- MLC flash SSDs have
error rates and are currently unrecoverable. It looks like Link_A_Media's
technology could improve the odds of
data recovery in
failed devices which incorporate its technology (as well as reducing data errors
while the SSD is still operational.)
Another side effect of their
technology may be better
says their IOP
Buster architecture enables scalability within the controller to address
various segments of SSD applications seamlessly. It enables faster Read and
Spectra Libraries will Log Tape Health Metrics
ORLANDO, FL - April 8, 2008 - Spectra Logic announced details of its
soon to be released new Media Lifecycle Management software for its tape
MLM will reduce backup failures by tracking
more than 30 pieces of information about individual LTO tapes and logging this
on on the tape's built in flash chip. Information such as: born-on date, number
of reads and writes, error rate, media quality, date of last access, application
usage, encryption information, cleaning log and remaining capacity are tracked.
MLM and BlueScale are compatible with all major
Editor's comments:- already past the decline and now in
the fall years of the tape
library market it looks like customers will get all kinds of useful
information and services which they probably would have liked to have before.
This sounds similar in concept to the
logs in hard disks
Pillar's Petabyte Arrays are 99.999% Available
Jose, Calif. - April 7, 2008 - Pillar Data Systems today announced
availability of the Pillar Axiom 500MC - a mission critical storage system .
The Pillar Axiom 500MC delivers up to 192GB of cache, with the ability
to scale capacity to 1.6 petabytes. The system supports both
fibre channel and
SATA disk drives.
Pillar guarantees 99.999% availability. ...Pillar profile
Does Unhappy Notebook Maker Have High Rate of SSD Flash Backs?
March 19, 2008 - a report discussed in an article on CNET saying that
flash SSDs in notebooks are incurring double digit customer reject rates has
been dismissed by Dell as "untrue."
Study Enumerates Key Factors in Disk Array Failures
March 6, 2008 - a recently published paper called - Are Disks the Dominant
Contributor for Storage Failures? - reports on a 3 year study of nearly 2
million operating disks.
Among the many findings:- the
annualized failure rate in near-line systems which mostly use
SATA disks is
approximately twice as high as in systems which mostly use
But other factors such as datapath resilience, presence or absence of
reliability of the
rack system components are just as significant contributors to storage
reliability as the hard
Are MLC SSDs Ever Safe in Enterprise Apps?
27, 2008 - STORAGEsearch.com published a new article today called - Are
MLC SSDs Ever Safe in Enterprise Apps?
This is a follow up
article to the popular
SSD Myths and
Legends which, in early 2007, demolished the myth that flash memory
wear-out (a comfort blanket beloved by many
RAM SSD makers)
precluded the use of flash in heavy duty datacenters.
article looks at the risks posed by MLC Nand Flash SSDs which have recently
hatched from their breeeding ground as chip modules in cellphones and morphed
into hard disk form
factors. It starts down a familiar lane but an unexpected technology twist
(which arrived in my email this morning) takes you to a startling new world of
possibilities. ...read the article
WEDC Targets Medical CompactFlash Market
Phoenix, AZ - December 19,
2007 - White Electronic Designs Corp is leveraging its defense industry
experience and expertise to develop high-reliability modules for the growing
portable medical device market.
According to the U.S. Census
Bureau, there will be an expected 40 million persons in the U.S. over the age of
65 by 2010, driving the need for portable medical devices, especially for home
use. The portable medical device market is driven by the same requirements
and expectations as the defense segment; such as high quality and reliability,
shorter development cycles, a well-defined and documented supply chain and
extended product lifecycles. Among other products WEDC designs and
manufactures one of the industry's first medical series CompactFlash cards.
Editor's comments:- WEDC has also recently
published a paper
CompactFlash Really Created Equal? (pdf) which uses the medical
instrumentation market as the backdrop for a discussion about
flash SSDs similar
to those concerns analyzed in
SSD Myths and
Legends - "write endurance" - which looked at the enterprise
Patent May Suit High Reliability SSD OEMs
MINNETONKA, MN - November 23, 2007 - ECC
Technologies, Inc. announces that its parallel Reed-Solomon error
correction designs and US Patent are immediately available for licensing.
PRS encoder and decoder designs allow parallel I/O storage devices to
be designed with automatic, built-in backup (fault-tolerance). PRS applied to
flash SSDs (for
example) enables SSDs to be designed that can tolerate NAND Flash chip failures.
PRS can also be applied to Hard Disk Arrays. Potential licensees can read
about the PRS technology applied to
SSDs and to
HDDs on these
preceding links. ...ECC
comments:- in the early days of a fast growing technology market most
vendors are too busy growing their revenue by selling products to customers.
But when markets get big enough or growth rates slow down - another round kicks
in - of harvesting money from those who succeeded in the market - but didn't
protect themselves properly with patents.
When I was a young engineer
several designs of mine did get patented. In one particular company I remember
being asked to leaf through some 10 year old logbooks of my predecessors to find
some prior art to help nullify a competitor's potential attack. I always
preferred doing things my own way - so I grumbled at being asked to delve into
these dusty old files. But I did find what my boss was looking for.
Panasas Solution Targets RAID Unreliability
FREMONT, CA - October
9, 2007 - Panasas, Inc. announced the Panasas Tiered Parity
Architecture which the company claims is the most significant extension to disk
array data reliability since Panasas CTO Garth Gibson's pioneering RAID
research at UC-Berkeley in 1988.
With the release of the
ActiveScale 3.2 operating environment, Panasas will offer an innovative
end-to-end Tiered-Parity architecture that addresses the primary causes of
problems and provides the industry's first end-to-end data integrity checking
protect against disk failures by calculating and storing parity data along with
the original data.
In the past 10 years, individual disk drives have
become approximately 10x more reliable and over 250x denser than those protected
by the first generation RAID designs in the late 1980s. Unfortunately, the
number of disk media failures expected during each read over the surface of a
disk grows proportionately with the massive increase in density and has now
become the most common failure mode for RAID. A RAID disk failure can cause loss
of all the data in a volume which may be tens of terabytes or more. Recovery
of the lost data from tape
(assuming that is all backed up) can take days or even weeks.
storage system vendors recognize this same issue and apply RAID 6, often called
double parity RAID, to address this problem. Double parity schemes only treat
the symptom of the failure, not the cause, and they carry substantial cost
and performance penalties, which will only get worse as disk drive densities
continue to increase.
Panasas Tiered Parity architecture directly
addresses the root cause of the problem, not the symptom. Solving the storage
reliability problem caused by these new 1TB and larger disks allows Panasas to
build larger and more reliable storage that allows users to get more value from
their data and are less expensive for IT to support.
"The challenges with storage system reliability today have
little to do with overall disk reliability, which is what RAID was designed to
address in 1988. The issues that we see today are directly related to disk
density and require new approaches. Most secondary disk failures today are the
result of media errors, which have become 250x more likely to occur during a
RAID failed-disk rebuild over the last 10 years," said Garth Gibson, CTO of
Panasas. "Tiered Parity allows us to tackle media errors with an
architecture that can counter the effects of increasing disk density. It also
solves data path reliability challenges beyond those addressed by traditional
RAID and extends parity checking out to the client or server node. Tiered Parity
provides the only end-to-end data integrity checking capability in the industry."
comments:- the problem of data corruption in large data sets because of
obsolete technology assumptions built into hard disks, interface and RAID
products has been looming for several years. You can see articles and research
about this on the storage
Is the solution more reliable hard drives?
better interfaces? or a smarter storage OS? Users can't wait another 5 years
for ideal solutions because the symptoms are there today when you look. The
Panasas solution sounds like a pragmatic tactical approach for some customers -
but the industry is a long way from a better storage reliability mousetrap.
Why Sun will Shine with a New Lustre
CLARA, Calif - September 12, 2007 - Sun Microsystems, Inc. today said
it will acquire the majority of Cluster File Systems, Inc.'s
intellectual property and business assets, including the Lustre File System.
Sun intends to add support for Solaris OS on Lustre and plans to
continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware
...Sun Microsystems profile,
Acquired storage companies
comments:- I hadn't heard of this company before. A sure sign that they
were heading straight for the
gone away storage
companies list without any deviations on route. Here's what I picked up from
their web site present and
product description (pdf) says - "the Lustre architecture was first
developed at Carnegie Mellon University as a research project in 1999."
The company's website started in about 2001 amd they released Lustre 1.0 in
had a product ready for a bigger market.
Strangely enough Solaris
support isn't listed as a strong feature in their recent
roadmap. So why does Sun
want this technology? - Well - even if you're not in the supercomputer business
- some technologies which start there eventually trickle down to the rest of us.
"Zero single points of failure" - mentioned on their home page - is a
good enough reason. As I wrote in my
7 year storage market
predictions (2005) storage
reliability is going to become a major headache in enterprise storage in the
next 5 years.
See also:- Robin
Harris's blog which explains the business background to CFS - "why
aren't they rich?"
Tapewise Enterprise Checks Tape Media Errors
UK - September 18, 2007 - Data Product Services today announces the
release of Tapewise Enterprise.
Tapewise is software that writes
data to a tape and then reads it again, tracking any errors, soft recoverable
ones or unrecoverable ones, that occur. It streams a whole tape through a drive
in this way and, with its Tape Error Map technology, produces a 3D graph
showing errors encountered along the length of a tape when data was being read
The user can decide what an acceptable error rate is
and that boundary will be shown on the graph with any error rates above the
user-defined norm instantly visible. The software supports a large number of
tape formats: 3480; 3490; DLT; SDLT; 3590; 9840; 9940; T10000; LTOs 1, 2 and 3
and 3592. Costs start at $16,000 approx. A free 14-day evaluation copy is
available. ...Data Product
Noise Damping Techniques for PATA SSDs
August 10, 2007 - SiliconSystems today published a new white paper
called - "Noise Damping Techniques for PATA SSDs in Military-Embedded
This article looks at electronic signal integrity
issues in integrating high speed PATA SSDs. It helps electronic designers
understand how factors such as ground bounce, loading, power supply noise and
signal trace mismatches can lead to false data or even device damage. Examples
given in the tutorial style commentary include scope shots and logic analyzer
the article, ...SiliconSystems
profile, storage chips,
comments:- the article gives a good grounding (couldn't resist that one) in
the signal quality factors needed to get high
operation and is equally relevant to
hard disks. To simplify
the 20 page document:- if you connect reliable electronic modules using
unreliable signal paths - that will compromise the integrity of the data. Logic
states are virtual - but digital signals are real and can have completely
different shapes to what you expect if you don't follow basic rules.
Squeak! - Green Storage - What's Green. What's Not
Editor:- June 24,
2007 - STORAGEsearch.com today published a new article - Green Storage -
Trends and Predictions.
There's a lot of nonsense in the media
about so called "Green Storage". This article blows away the
and clears the air for a better view of forward looking green data storage
technologies. Reliability gets an honorable mention. Find out what's really
green - and what's not. ...read the article
Unreliability Costs are Reason to Switch to SSDs
Viejo, Calif., May 30, 2007 - SiliconSystems, Inc. today announced the
publication of a white paper called - "Solid-State Storage is a
Cost-Effective Replacement for Hard Drives in Many Applications."
The paper cites data from Google
and Carnegie Mellon University that
indicates hard drive
field failure rates are up to 15x greater than quoted in disk
manufacturer data sheets. The white paper was developed by SiliconSystems to
educate OEMs about the numerous technical and business decisions they must
successfully navigate to select the best storage solution for their application.
the article (pdf),
is a type 4 application in our
Debunking Misconceptions in SSD Longevity
Editor:- May 11,
2007 - BiTMICRO Networks today published a new article called - "Debunking
Misconceptions in SSD Longevity."
It cites lifetime
predictions from my own popular article -
SSD Myths and
Legends - "write endurance" and fires a warning shot aimed at
some competitors by saying "some
flash SSD makers
have even quoted higher write endurance ratings than those provided by
manufacturers of their flash
That's certainly true - but I knew when
writing my article that endurance varies from batch to batch of flash chips
within the same semiconductor fab process. Some SSD oems
sample test and
reject chips which are at the lower end of the distribution curve. That
means their worst case numbers are better than would be the case by simply
accepting merchant quality flash chips. Although starting from a different base
of assumptions -
article "conclude(s) that fears about the endurance limitations of
SSDs are rightfully fading away."
Seagate Drops Notebook Drives
VALLEY, Calif - March 12, 2007 - Seagate Technology today announced the
worldwide availability of a 7,200 RPM hard drive with free-fall
protection for beefed-up laptop durability.
Momentus 7200.2 delivers up to 160GB of capacity and has a
SATA interface. The
hard drive is also
offered with an optional free-fall sensor to help prevent drive damage and data
loss upon impact if a laptop PC is dropped. The sensor works by detecting any
changes in acceleration equal to the force of gravity, then parking the head off
the disc to prevent contact with the platter in a free fall of as little as 8
Editor's comments:- Hitachi revealed details
about its similar
ESP drop sensor
in 2005. The drop sensor approach is better than nothing, but doesn't get
around the unavoidable fact that hard disks can break when dropped.
approach is that of Olixir
Technologies who have marketed repackaged high performance hard drives which
can be dropped repeatedly onto
a concrete floor from 6 feet and still survive.
solid state disks are
inherently even tougher than that because there are no internal moving parts to
crash together. That's
why they have been used in space ships, helicoptors and missiles. In 2006
In-Stat predicted that
half of all mobile computers would use SSDs (instead of
hard disks) by 2013.
It's not just the ruggedness and better power consumption. A
by Samsung demonstrates the advantages more graphically.
Hard Disk MTBF Specs Incredible - Say
February 28, 2007 - an article published today in Channel Insider - "Hard
Disk MTBF: Flap or Farce? - casts serious doubt on the inflated MTBF claims made
by all hard disk manufacturers.
Reviewing a number of recently
published reliability studies from end users - the author
Morgenstern says "...there's a gap between the reliability
expectations of manufacturers and customers. The current MTBF model isn't
accounting accurately for how drives are handled in the field and how they
function inside systems." ...read
the article, storage
Google Reports on HDD Reliability
February 20, 2007 - Researchers at Google recently published a paper at
the recent Usenix conference about hard disk reliability and failure
prediction - based on their own experiences as a large user of hard disk drives.
The fascinating paper describes how Google measured available metrics
and status reports generated by the drives themselves and how this correlated
with actual failure patterns. One of the key insights in the report is Google's
view of how useful
parameters were for predicting failures.
"Our results are
surprising, if not somewhat disappointing. Out of all failed drives, over 56% of
them have no count in any of the four strong SMART signals, namely scan errors,
reallocation count, offline reallocation, and probational count. In other words,
models based only on those signals can never predict more than half of the
failed drives... ...even when we add all remaining SMART parameters (except
temperature) we still find that over 36% of all failed drives had zero counts on
all variables." ...read
the article, Hard disk
PS - the measured data on the percentage of disks which
fail each year over a 5 year cycle under various conditions is essential
reading for disk to disk backup
Agere Halves Power Consumption for Mobile HDD Interface
ALLENTOWN, Pa - February 6,
2007 - Agere Systems has begun shipping a new fully functional
90-nanometer TrueStore read channel.
The TrueStore RC1300 uses half the current required by the previous
generation of read channel chip technology in this market segment and is 25%
faster. It targets the 1.8-inch and smaller
HDD form factor that
provides critical data storage of 20 to 160 gigabytes in a wide variety of
...Agere Systems profile
STORAGEsearch.com Launches a New Strategic Directory - Storage
Editor:- June 20, 2006 - STORAGEsearch.com
today launched a new directory dedicated to the subject of "Storage
Reliability was named as one of the 3 most important
future trends in storage in my
state of the storage market
article published last year. In that article I also predicted that
uncorrectable failures in storage systems (due to embedded design assumptions
made in earlier generations) could, if not dealt with by drive and interface
designers, pose a more serious threat to enterprise computer systems
than the Y2K bug in the late 1990s.
In addition to covering
news about what the industry is doing to improve reliability in future drives,
media and interfaces, STORAGEsearch has invited CTOs and technical directors of
leading companies to write special articles about this subject - which will
appear in the months ahead.
When most people think about storage
reliability - they think about MTBF and thermal factors.
individual drive isn't reliable enough - wrap it in a
RAID. If heat reduces the
life of the disks - then cool them with more fans. If a memory system or
interface is critical to an application - cocoon it with error detection and
correction codes. Those are approaches which have worked adequately for the
past few decades - but they are not good enough any more.
for storage reliability are growing. Non stop applications need data that can
be trusted to be available on demand.
dictates that data should be readable not just years - but possibly decades
after it was created. Meanwhile storage components, interfaces and systems are
increasing in speed and capacity - while many of them are using error correction
thinking that comes from earlier generations when data sets were smaller. As
storage gets bigger - users face the risk of having uncorrectable errors in the
heartland of their decision making data. That's why - all over the industry -
manufacturers are starting to talk about new storage reliability initiatives.
also the risk that new storage technologies which get rushed to serve the needs
of the consumer market - have not in fact been tested long enough to guarantee
that they will not fail or start to corrupt data in the timeframe that
enterprise customers care about.
Wrapping arrays of consumer
disks based on new 2
year proven media technology in a big "enterprise" box - cannot
guarantee that the data will still be readable in 5 years time. This is not a
worry for consumers. They'll throw a failed disk away or buy a new one. But if
your enterprise owns thousands of these disks (hidden by virtualization) it
could be a big headache when the crumbly nature of the storage defects start to
hit the news. This is
another of the many concerns we'll be covering in these pages.
Storage media have
failed in the past and been withdrawn because they didn't meet their original
extrapolated lifetimes. Lessons are not always learned from errors in the past
- but can be forgotten and reoccur.
Storage reliability is changing.
If you are interested - I hope you'll stay tuned to the new storage reliability
channel here on the mouse site - as we report on these exciting developments in
the months ahead.
Why Solaris will Get 128 Bit Addresses
May 1, 2006 - an article today in InformationWeek.com discusses the
Zettabyte File System - a new 128 bit addressing scheme for Solaris.
article says that apart from the obvious advantage of being able to access more
storage, Sun is apparently thinking about building in error correction into the
new address scheme.
published last year in STORAGEsearch.com
- Storage Reliability and failures were cited as one of the most important
long term problems which oems and users will have to deal with.
cause of the problem is that storage interfaces as well as modules and
components (like disks,
optical drives etc) use
error correcting schemes which were designed for the much smaller and slower
architectures of the past. As storage systems expand - new algorithms and
correction schemes will be needed to guarantee that users don't get affected by
data failures which are uncorrectable using today's products and
It's good to see that Sun is working proactively
on one aspect of the problem. I've talked to many storage manufacturers about
the upcoming reliability problem - which could be more serious than the Y2K
threat - if not dealt with in advance. Sun is highly sensitive to data
with its own SPARC server cache memory design back in 2001- were cited at
the time by many large users as reasons for considering a switch to Intel and
PowerPC based systems.
SPARC Product Directory
Hard Disk Sector Size May Change
Calif - March 23, 2006 - IDEMA today announced the results of an
industry committee assembled to identify a new and longer sector standard for
future magnetic hard disk drives.
This Committee recommended
replacing the 30 year-standard of 512 bytes with sectors having ability to store
4,096 bytes. Dr. Ed Grochowski, executive director of IDEMA US, reported that
adopting a 4K byte sector length facilitates further increases in data density
for hard drives which will increase storage capacity for users while continuing
to reduce cost per gigabyte.
"Increasing areal density of newer magnetic
hard disk drives
requires a more robust error correction code, and this can be more efficiently
applied to 4,096 byte sector lengths," explained Dr. Martin Hassner from
Hitachi GST and IDEMA Committee member.
Whitepaper Measures ROI of Disk Defragmentation
CA - January 24, 2006 - Diskeeper recently sponsored IDC to
write a whitepaper called - "Defragmentation's Hidden Value for the
This measured the ROI of defragmentation
software in real customer sites. During the reliability test, the servers that
were defragmenting files automatically had a higher uptime (5 to 10%) than the
servers that didn't have defragmentation software automatically running.
the article (pdf), ...Diskeeper
ProStor Systems Unveils New Backup
CO - November 2, 2005 - ProStor Systems made its public debut
today by introducing the firm's RDX removable disk backup technology.
The RDX removable cartridge uses the same 2.5" hard disk media platters
found in notebook computers and provides initial capacity upto 400GB
(compressed). That will will increase in line with conventional hard disk
technology. But the difference is that RDX uses a new patent-pending
error correcting format, which makes the data 1,000 times more
recoverable than in a standard hard drive. ProStor says this means that
RDX-stored data will be readable even after the cartridge has been archived
and non-operating more than a decade. ...ProStor Systems profile,
Disk to disk backup,
comments:- the reliability of embedded storage modules and components such
tape drives and
optical disks will become
an important issue for users in the
next 7 years.
products rely on inbuilt error correction algorithms which were designed over a
decade ago - when storage capacities were much smaller. All those "ten to
the minus something" numbers which you see quoted for error rates sound
good - except that when your enterprise is managing Petabytes of data, at every
higher connection speeds, then you will start seeing uncorrectable data
failures occurring every year - inside the storage, and beyond the scope
of your RAID or other
protection scheme to correct. ProStor is one of a new generation of storage
manufacturers addressing this problem, and we'll soon publish a directory
section dedicated to storage reliability issues such as this.