click to visit home page
leading the way to the new storage frontier .....
RAID ...
click to see the collection of  SSD reliability articles here on
SSD reliability ..
high availabaility SSD arrays
HA / FT SSDs ..
SSD myths - write endurance
SSD endurance ..
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics
SSD power loss ..
SSD ad - click for more info

storage reliability - news & white papers

Improving 3D NAND Flash Memory Lifetime - new paper

Editor:- August 28, 2018 - A new twist using RAID ideas in SSD controllers has surfaced recently in a research paper - Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation (pdf) by Yixin Luo and Saugata Ghose (Carnegie Mellon University), Yu Cai (SK Hynix), Erich F. Haratsch (Seagate Technology) and Onur Mutlu (ETH Zürich) - which was presented at the SIGMETRICS conference in June 2018.

The authors say that in tall 3D nand (30 layers and upwards) the raw error rate in blocks in the middle layers are significantly worse (6x) compared to the top layer. To enable more reliable and faster SSDs using 3D nand for enterprise applications they propose a new type of RAID - LI-RAID. ... read the article (pdf)

wrapping up 40 years of memories about endurance

Editor:- July 20, 2018 - wrapping up SSD endurance (selective memories from 40 years of thinking about endurance) is my new blog on the home page of

This may be my last article on endurance. No more. Ever. I promise. (I may have said that before but this time I really mean it.) the article

reliability aspects of 100TB SAS SSDs

Editor:- March 19, 2018 - Nimbus Data Systems has made another significant advance in the development of multipetabyte energy-efficient solid state storage racks with the announcement today that it's sampling 100TB 3.5 SAS SSDs with unlimited DWPD.

The ExaDrive DC100 has balanced performance 100K IOPS R/W and up to 500 MBps throughput and consumes 0.1 watts/TB - which Nimbus says is 85% lower than competing drives used in similar array applications - such as the Micron's 7.68TB 5100 SATA SSD.

ExaDrive technology and reliability?

I asked Thomas Isakovich, CEO and founder of Nimbus some questions about the new ExaDrive technology.

Editor - The 50TB models announced by your flash partners last year used planar 2D flash. Does the 100TB family use 3D flash? Knowing the answer one way or another will enable some people to make their own judgements about incremental upsides in the next year or so's roadmap. And also form a view about specification stability and reliability.

Tom Isakovich - Yes 3D flash for the ExaDrive DC.

Editor - The issue of cost per drive is an interesting one too. But the companies you were working with last year have experience in processes which can produce a high confidence reliable SSD for high value, mission critical markets (like military) in which the reliability of every single SSD is critical. So my guess would be that for integrators who have a serious interest in the ExaDrive DC100 – they will be looking at the cost of drive failures on a system population basis – and the value of less drives and less heat per TB is more important than the headline cost of a single failed drive.

Tom Isakovich - I have an interesting subject for you to consider on the topic of "reliability". Namely, is an SSD any less reliable than an all-flash array? I contend that it is not. In fact, an SSD is more reliable.
  • Our ExaDrive DC has flash redundancy internally, with the ability to lose about 8% of flash dies without any downtime, data loss or capacity reduction. This is analogous to RAID in a traditional all-flash array that protects against media failure. So on the notion of media redundancy, they are equally redundant.
  • The ExaDrive DC has a 2.5 million hour MTBF with no moving parts. That is about 6 times longer than the typical all-flash array (which includes) many active and moving parts. All-flash arrays have integrated power supplies, active controllers, fans, and other components prone to failure.
I'm thinking more on this. But empirically, an SSD is more reliable than a System. The user can achieve desired redundancy in their overall architecture, taking this into consideration.

See also:- rackmount SSDs

a Survey of Techniques for Architecting Hybrid Flash based SSDs

Editor:- December 20, 2017 - This month I received a copy of a new (to me) paper - a Survey of Techniques for Architecting SLC/MLC/TLC Hybrid Flash Memory based SSDs (27 pages pdf) - from Sparsh Mittal, Assistant Professor at Indian Institute of Technology Hyderabad who is among the co-authors of this significant reference document.

Although the primary purpose of the paper is to record the comparitive design tradeoffs between different memory types in the same SSD it also looks at the tactical use of virtualized pSLC too.

There are over 60 cited references to external papers - so it's a rich source of ideas for SSD designers.

Here's just a single sentence:- "It is noteworthy that the technique of Jimenez et al. [33] converts MLC blocks into SLC when they exhaust their lifetime to benefit from the high endurance of SLC blocks. By comparison, other soft partitioning techniques perform mode-transition." the article (27 pages pdf)

the failure to make enough working memory chips

Editor:- September 7, 2017 - The biggest failure in the SSD market in recent times was the collective failure of all the leading memory companies to manage the introduction of their new 3D technologies in a way which aligned with past roadmap predictions and expectations. I discussed the causes and consequences in 2 articles on

the reliability difference in solo industrial SSDs

Editor:- July 14, 2017 - Reliability is one of the concerns which got me interested in SSDs in the late 1980s, and the other factor was raw speed - sometimes - but not always - both in the same project. And different ways of looking at reliability is one of the recurring themes which I notice in stories about the industrial SSD market.

Earlier this year I had noticed a statement in one of the customer case studies on the web site of Cactus Technologies which talked about having delivered 200,000 high reliability flash storage cards to a customer "without a reported failure". And from time to time I wondered what did that really mean?

So this week I asked Steve Larrivee, VP Sales & Marketing at Cactus what was the time period behind the story?

Steve said - "The 200,000 cards were delivered over a 2 year period over 5 years ago without one reported failure."

Editor's comments:- I thought this was an impressive retrospective story and for customers with applications where the reliability of each solo SSD is critical it's a more convincing positioning statement about the design and manufacturing capabilities of the SSD creator than any forward reaching promises can be.

After our exchange of emails Steve wrote a new blog about this - Would Memory Failure Be Catastrophic to your business? - which included additional anecdotal failure rates for the same application which happened when the customer switched to a lower cost memory SSD design from a competing high quality supplier.

trust and services marketing related to enterprise SSD systems
why was it so hard to compile a simple list of military SSD companies?

hard delays from invisibly fixed soft flash array errors can break enterprise apps - says Enmotus - arguing need for better storage analytics

Editor:- June 15, 2017 - Using SSDs as its prime example - but with a warning shot towards the future adoption of NVDIMMs - a new blog - storage analytics impact performance and scaling - by Jim O'Reilly - on the Enmotus blog site - describes how soft errors can contribute to application failure due to unexpected sluggish response times even when the data is automatically repaired by SSD controllers and when the self-aware status of the SSDs is that they are all working exactly as designed.

That's the needs analysis argument for storage analytics such as the software from Enmotus which supports the company's FuzeDrive Virtual SSD.

Jim says - "Storage analytics gather data on the fly from a wide list of "virtual sensors" and is able to not only build a picture of physical storage devices and connections, but also of the compute instance performances and VLANs in the cluster. This data is continually crunched looking for aberrant behavior." the article

Editor's comments:- in my 2012 article - will SSDs end bottlenecks? - I said "Bottlenecks in the pure SSD datacenter will be much more serious than in the HDD world - because responding slowly will be equivalent to transaction failure."

And in a 2011 article - the new SSD uncertainty principle - I shared the (new to me) wisdom collected by long term reliability studies of enterprise flash done by STEC - that many flavors of flash controller management contained within them the seeds of performance crashes which would only become apparent after years of use as the data integrity algorithms escalated to progressively more retries and stronger ECC to deliver reliable data from wearing out (but still usable) flash.

So I agree with Jim O'Reilly. You do need more sophisticated datasystems analytics then whether or not an SSD has failed.

The variable quality of latency can be a source of incredibly long delays in server DRAM too.

Soft-Error Mitigation for PCM and STT-RAM

Editor:- February 21, 2017 - There's a vast body of knowledge about data integrity issues in nand flash memories. The underlying problems and fixes have been one of the underpinnings of SSD controller design. But what about newer emerging nvms such as PCM and STT-RAM?

You know that memories are real when you can read hard data about what goes wrong - because physics detests a perfect storage device.

A new paper - a Survey of Soft-Error Mitigation Techniques for Non-Volatile Memories (pdf) - by Sparsh Mittal, Assistant Professor at Indian Institute of Technology Hyderabad - describes the nature of soft error problems in these new memory types and shows why system level architectures will be needed to make them usable. Among other things:-
  • scrubbing in MLC PCM would be required in almost every cycle to keep the error rate at an acceptable level
  • read disturbance errors are expected to become the most severe bottleneck in STT-RAM scaling and performance the article (pdf)

Microsemi's rad tolerant FPGAs orbit Jupiter

Editor:- September 20, 2016 - Microsemi today announced that its radiation-tolerant FPGAs are in use on NASA's Juno Spacecraft within the space vehicle's command and control systems, and in various instruments which have now been deployed and are returning scientific data. Juno recently entered Jupiter's orbit after a 5 year journey.

See also:- Juno mission (pdf), data chips in space

relating NVMdurance's machine learning to manual tuning

Editor:- July 29, 2016 - Nearly every SSD in the market today from the smallest SSD on a chip to the bewildering array of rackmnount systems can be viewed as a choice of how to select and mix the raw ingredients of SSD IP and integrate them into products which (for better or worse) match up to and satisfy user needs. How these decisions are made depends on the DNA of the product marketers, the technology teams, familiarity and ease of access to some technologies rather than others, business pressures and timing, the willingness to take risks, and sometimes - just luck.

But all products - no matter how complex they appear - can be analyzed as a specific set of choices made from the architecture and IP selections which are possible.

In many articles in the past I've shown you how - whether you're looking at the design of SSDs or systems - there are rarely more than 2, 3 or 6 raw available decisions which determine each piece of the jigsaw. And I know from the feedback I get from SSD specifiers and architects that these simple classifications can be useful in helping to compare different products and even in choosing which competitive approaches are similar enough to make comparisons worthwhile.

But when you get down into the details of implementation at each layer in the product design - every one of these dimensional options which go into the permutations blender to shape the total product identity - can itself be complex and multilayered.

Take the example of the raw magic tuning numbers which enable the raw R/W program, erase, threshold voltages, shaping and timing parameters inside a flash memory. The question of how much and when has been at the heart of what makes some SSDs better than others ever since flash was first used in SSDs.

Some SSD designers have spent their whole careers measuring and modeling how these choices interact with the flash cell and can be tweaked to improve speed, power consumption and reliability. You can get a flavor of this in my article - adaptive R/W and DSP ECC IP.

In a conversation with NVMdurance's CEO - Pearse Coyle earlier this year (April 2016) almost the first thing I did was try to relate and place the work they were doing within the simple frameworks I'd written about before.

So I asked him how similar it was to something which I wrote a long article about in April 2012 - when SMART announced a range of SandForce driven SSDs which had 5x higher endurance - while using exactly the same industry controller - but using magic tuning numbers which they had learned from analyzing the adaptive settings from their own adaptive controller design.

Pearse said - yes - he knew that work. And what NVMdurance was doing was the same type of thing.

He said that some leading companies which had the flash talent had done similar things in their proprietary SSDs before.

Pearse told me that as the complexity of flash increased - with more layers and TLC - it was becoming harder for designers to manually (or using human expertise) guarantee they were choosing the optimum magic numbers - because there were now so many variables involved.

Pearse said that what was different about NVMdurance was that they were delivering the magic numbers based on characterising a sample of typically 100 devices and then performing machine based simulations to see which numbers would work best - while also using a multi-stage life cycle model - which was designed to use different tuning after a fractional amount of the expected endurance had been used.

As far as he knew from his conversations with memory companies - no-one else had made the same kinds of investments in this machine intensive modeling - and that was the key difference - because NVMdurance had a proven process for delivering good tuning numbers over a variety of memory generations and types.

I hoped at the time that someone would write a paper saying more about it. Tom Coughlin has done that.

Machine learning enables longer life high capacity SSDs (pdf) - published this week describes the background principles and operation of NVMdurance's pathfinder and plotter software tools and shows you how NVMdurance have tackled this complex tuning problem to deliver a software delivered IP which can give endurance results which are similar to adaptive adaptive R/W controllers but which don't need such expensive processors or such complex run-time firmware. the article (pdf)

can memory do more?

Editor:- June 17, 2016 - in a new blog on - I ask - where are we heading with memory intensive systems and software?

All the marketing noise coming from the DIMM wars market (flash as RAM and Optane etc) obscures some important underlying strategic and philosophical questions about the future of SSD.

When all storage is memory - are there still design techniques which can push the boundaries of what we assume memory can do?

Can we think of software as a heat pump to manage the entropy of memory arrays? (Nature of the memory - not just the heat of its data.)

Should we be asking more from memory systems? the blog

It's not worth paying more for SLC reliability in PCIe SSDs says Google field study

editor:- February 26, 2016, 2016 - A 6 year study of PCIe SSDs used by Google (spanning millions of drive days and chips from 4 different flash vendors) concluded that SLC drives were not more reliable than MLC.

An important conclusion re RAS is the importance of being able to map out bad chips within the SSD architecture. This is because somewhere between 2% to 7% of enterprise PCIe SSDs (depending on where they were used) developed at least bad chip during the first 4 years in the field - which without such remapping would necessitate replacing the failed SSD.

The source is - Flash Reliability in Production: the Expected and the Unexpected (pdf) - by Bianca Schroeder University of Toronto, Raghav Lagisetty and Arif Merchant, Google.

This is just one of a set of papers which was presented February 22 - 25 , 2016 at the 14th USENIX Conference on File and Storage Technologies.

Editor's comments:- For more like this see the news archive - June 2015 which had a story about a large scale study of PCIe SSD failures within Facebook.

Mirabilis discusses role of deployment level simulation to optimize reliability delivered by SSD controller design tweaks

Editor:- August 16, 2015 - "A diligent system designer can extend the life of an SSD by upto 60% by proper control of over-provisioning, thus reducing TCO" says Deepak Shankar, Mirabilis Design in his recent paper Extending the Lifetime of SSD Controllers (pdf) which discusses the role of application and deployment level simulations to explore the impact of changing brews in controller architectural coctails.

See also:- SSD overprovisioning articles 2003 to 2015

bath tub curve is not the most useful way of thinking about PCIe SSD failures - according to a large scale study within Facebook

Editor:- June 15, 2015 - A recently published research study - Large-Scale Study of Flash Memory Failures in the Field (pdf) - which analyzed failure rates of PCIe SSDs used in Facebook's infrastructure over a 4 year period - yields some very useful insights into the user experience of large populations of enterprise flash.

Among the many findings:-
  • Read disturbance errors - seem to very well managed in the enterprise SSDs studied.

    The authors said they "did not observe a statistically significant difference in the failure rate between SSDs that have read the most amount of data versus those that have read the least amount of data."
  • Higher operational temperatures mostly led to increased failure rates, but the effect was more pronounced for SSDs which didn't use aggressive data throttling techniques - which could prevent runaway temperatures due to throttling back their write performance.
  • More data written by the hosts to the SSDs over time - mostly resulted in more failures - but the authors noted that in some of the platforms studied - more data written resulted in lower failure rates.

    This was attributed to the fact some SSD software implementations work better at reducing write amplification when they are exposed to more workload patterns.
  • Unlike the classic bathtub curve failure model which applies to hard drives - SSDs can be characterized as having early an warning phase - which comes before an early failure weed out phase of the worst drives in the population and which precedes the onset of predicted endurance based wearout.

    In this aspect - a small percentage of rogue SSDs account for a disproportionately high percentage of the total data errors in the population.
enterprise array reliability study in Facebook
The report contains plenty of raw data and graphs which can be a valuable resource for SSD designers and software writers to help them understand how they can tailor their efforts towards achieving more reliable operation. the article (pdf)

See also:- SSD Reliability

HDD failure rates analyzed by models

Editor:- May 27, 2015 - The reliability of hard drives in a cloud related business (online backup) is revealed in a new report - Hard Drive Reliability Stats by Backblaze which includes results for over 42,000 drives analyzed across 21 drive models.

The failure distribution in the recent quarter is model and age specific rather than manufacturer specific - which is to say that you can't say that Seagate is always better or worse than Western Digital. The table also gives you insights into drive improvements for this type of application. Failure rates in the quarter were:-
  • upto 1 year old- worst model - 13%
  • 2-3 years - worst model - 27%
  • 3 - 4 years -worst model - 3%
  • 5 years - -worst model - 32%
The data seems to fit in with the bath tub curve model - with high infant mortality, high failures at the end and best reliability in the in between periods. the article

high availability enterprise SSD arrays

Editor:- January 26, 2012 - due to the growing number of oems in the high availability rackmount SSD market today published a new directory focusing on HA enterprise SSD arrays.

The new directory will make it easier for users to locate specialist HA SSD vendors, related news and articles.

Pushing data reliability up hard drive hill

Editor:- July 4, 2011 - Why didn't hard drives get more reliable? Enterprise users are still replacing hard drives according to cycles that have haven't changed much since RAID became common in the 1990s. So why didn't HDD makers do something to make their drives better?

Error correction code inventor Phil White - founder of ECC Technologies has recently published a rant / blog in which he describes the 25 years of rejections he's had from leading HDD makers - and the reaons they said they didn't want to use his patented algorithm - which he says could increase data integrity and the life of hard drives (and maybe SSDs too.) It makes interesting reading for any other wannabe inventors out there too. Phil White's article

But I think another reason for past rejections might simply have been market economics.

The capacity versus the cost of HDDs has improved so much throughout that period - and at the same time data capacity needs have grown - maybe the user value proposition didn't make sense.

If you (RAID user) find that all your 5 year old drives are still working (instead of being replaced) - how much is that really worth? By now those 5 year old drives might only represent 3% to 10% of the new storage capacity you need anyway. (The reliability value proposition is different outside service engineer frequented zone - but I don't want to get side-tracked into SSD market models here.)

Looking ahead at the future of the HDD market my own view is that whatever the industry does with respect to reliability won't tip the balance against SSDs in the enterprise.

The best bet for the future of hard drive makers is in consumer products where fashion ranks higher up the reason to buy list than longevity. Most people I know replace their notebook pcs, tvs and phones not because the old ones have stopped working - but because the new ones have lifestyle features which make them more desirable.

optimizing SSD architecture to cope with flash plane errors

Editor:- May 26, 2011 - a new slant on SSD reliability architectures is revealed today by Texas Memory Systems who explained how their patented Variable Stripe RAID technology is used in their recently launched PCIe SSD card - the RamSan-70.

TMS does a 1 month burn-in of flash memory prior to shipment. (One of the reasons cited for its use of SLC rather than MLC BTW.) Through its QA processes the company has acquired real-world failure data for several generations of flash memory and used this to model and characterize the failure modes which occur in high IOPs SSDs.

Most enterprise SSDs use a simple type of classic RAID which groups flash media into "stripes" containing equal numbers of chips. RAID technology can reconstruct data from a failed Flash chip. Typically, when a chip or part of a chip fails, the RAID algorithm uses a spare chip as a virtual replacement for the broken chip. But once the SSD is out of spare chips, it needs to be replaced.

VSR technology allows the number of chips to vary among stripes, so bad chips can simply be bypassed using a smaller stripe size. Additionally, VSR provides greater stripe size granularity, so a stripe could exclude a small part of a chip rather than having to exclude an entire chip if only part of it failed - "plane error". With VSR technology, TMS says its SSD products will continue operating longer in the installed base.

Dan Scheel, President of Texas Memory Systems explained why their technology increases reliability.

"...Consider a hypothetical SSD made up of 25 individual flash chips. If a plane failure occurs that disables 1/8 of one chip, a traditional RAID system would remove a full 4% of the raw Flash capacity. TMS VSR technology bypasses the failure and only reduces the raw flash capacity by 0.5%, an 8x improvement. TMS tests show that plane failures are the 2nd most common kind of flash device failures, so it is very important to be able to handle them without wasting working flash."

Editor's comments:- by wasting less capacity than simpler RAID solutions - more usable capacity remains available for traditional bad block management. This extra capacity comes from the over provisioning budget which figure varies according to each SSD design (as discussed in my recent flash iceberg syndrome article) but is 30% for TMS.

what happens in SSDs when power goes down? - and why you should care

Editor:- February 24, 2011 - today published a new article - SSD power is going down! - which surveys power down management design factors in SSDs.

Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - is really important in determining SSD data integrity and operational reliability. This article will help you understand why some SSDs which work perfectly well in one type of application might fail in others... even when the changes in the operational environment appear to be negligible. If you thought endurance was the end of the SSD reliability story - think again. the article

Business opportunities from Intel's imperfect bridge chips

Editor:- February 9, 2011 - Intel Knowingly Sells Faulty Chipsets. are they Crazy? is a new article on which discusses how Intel is dealing with the issue of a bridge chip with known defects in some SATA ports.

I rarely read that publication because my interests are enterprise storage and SSDs - but the author Keir Thomas had linked to from another recent article he wrote - Seagate: SSDs are Doomed (at Least for Now) - which showed up in my web stats.

When I started my storage reliability directory in 2006 - I knew that large storage vendors would ship flaky SSDs and hard drives - but I assumed that would be due to the unwitting and creeping use of inappropriate design and testing methodologies - rather than deliberate business decisions.

Another characteristic of this Intel chip is that if oems populate all the RAM slots which it "supports" - the speed drops down to unattractive levels.

But that's not bad news for everyone. Adrian Proctor, VP of of Marketing at Viking told me last month it means there's a growing population of DIMM slots on motherboards which can't be used for RAM - but could be used instead to save space and power by installing their SATADIMM SSDs to replace HDDs as boot drives. Other companies make 1 inch and smaller SSDs too.

comparing SSD and HDD failure rates in retail

Editor:- December 10, 2010 - the failure rates for SSDs and hard drives in the retail channel are compared in a recent article which is part of a regular feature on the French website HARDWARE.FR. Because many consumer SSD designs have been flaky - the apparent similarities suggested in the French report should not be taken to be typical of SSDs as a whole.

On the contrary - a much bigger difference in field reliability is suggested by the business models of industrial SSD makers and enterprise server SSD makers for whom better reliability is part of the value proposition - and by anecdotal reports which I've had from many data recovery companies.

10,000x more reliable than RAID?

Editor:- August 26, 2010 - Amplidata claims that its BitSpread technology is 10,000x more reliable than current RAID based technologies and requires 3x less storage.

Is another new way of fixing reliability problems in hard disk arrays worth the effort just as we approach the end of the hard disk market's life? - I doubt it. See why in - this way to the petabyte SSD.

how to make "SSD reliability" believable - marketing case study

Editor:- July 29, 2010 - today published a new article - the cultivation and nurturing of "reliability" in a 2.5" SSD brand.

Reliability is an important factor in many applications which use SSDs. But can you trust an SSD brand just because it claims to be reliable?

As we've seen in recent years - in the rush for the SSD market bubble - many design teams which previously had little or no experience of SSDs were tasked with designing such products - and the result has been successive waves of flaky SSDs and SSDs whose specifications couldn't be relied on to remain stable and in many products quickly degraded in customer sites.

As part of an education series for SSD product marketers - this new case study describes how one company - which didn't have the conventional background to start off with - managed to equate their brand of SSD with reliability in the minds of designers in the embedded systems market. the article

Anobit aims at SandForce SSD SoCs slots

Editor:- June 15, 2010 - Anobit announced it is sampling SSDs based on its patented Memory Signal Processing technology which provide 20x improvement in operational life for MLC SSDs in high IOPS server environments.

Based on proprietary algorithms that compensate for the physical limitations of NAND flash, Anobit's MSP technology extends standard MLC endurance from approximately 3K read/write cycles to over 50K cycles - to make MLC technology suitable for high-duty cycle applications. This guarantees drive write endurance of 10 full disk writes per day, for 5 years, or 7,300TBs for a 400GB drive, with fully random data (worst-case conditions).

First-generation Anobit Genesis SSDs deliver 20,000 IOPS random write and 30,000 IOPS random read, with 180MB/s sustained write and 220MB/s sustained read.

Anobit says that some of the world's largest NAND manufacturers, consumer electronics vendors and storage solution providers currently utilize Anobit's MSP technology in their products.

"For too long, the high prices of SLC SSDs and concerns about MLC SSD endurance have slowed the adoption of flash memory storage in the enterprise. Anobit Genesis SSDs effectively neutralize both of these concerns," said Prof. Ehud Weinstein, Anobit CEO. "By delivering true enterprise-class SSD reliability at affordable MLC SSD prices, Anobit Genesis SSDs unlock the full promise of solid-state enterprise storage."

Editor's comments:- superficially the endurance delivered by Anobit's SSD controller is better than that obtainable from SandForce - whereas the performance lead is the other way around. For most oems what will be more important is that they do not need to be locked into a single technology supplier to get adequate metrics for their MLC SSD product lines.

flash SSD integrity architectures for space-craft

Editor:- April 13, 2010 - for those interested in flash SSD data integrity issues - Phil White, President of ECC Technologies has released a white paper - NAND Flash Memories for Spacecraft (doc).

Phil has been working with ECC for almost 37 years and his company is developing future ECC designs to allow systems architects to develop NAND flash memories that are highly reliable and fault-tolerant even if the NAND flash chips themselves are not so reliable.

NASA is using ECC Tek's designs in multiple missions. 2 of the designs are in space at the present time and are working perfectly. Phil White recently wrote a document for NASA and JPL which outlines how to design NAND Flash memories for spacecraft. The 22 page "preview" document excludes confidential data but gives a taste of the technology available for licensing. the article

XLC promises "enterprise" hybrid x4 SSDs

Editor:- April 1, 2010 - XLC Disk announced details of a paper it will discuss later this month at the NV Memories Worskhop (UC San Diego) called - "Paramagnetic Effects on Trapped Charge Diffusion with Applications for x4 Data Integrity."

The company says its findings could have applications in the enterprise storage market by solving the data integrity problems in x4 MLC SSDs within a new class of hybrid storage drives. more

New Integrity Tool for Old Tape Archives

Editor:- January 18, 2010 - Crossroads Systems today announced details of ArchiveVerify - a new monitoring option for its ReadVerify Appliance that safeguards the future readability of data backed up on tape.

"In our experience, the Achilles' heel of a data recovery strategy is often the uncertainty of the data's readability, and this single point of failure can render then entire restore process useless," adds Bernd Krieger, Managing Director, at Crossroads Europe.

Editor's comments:- Crossroads was originally a specialist in the SAN router business. In recent years it has done a lot of work in the area of storage reliability. I've read lots of their whitepapers which describe their research and products addressing data integrity. Although there has been a historic trend for users to migrate away from tape to disk backup - many super users of huge tape libraries (with the biggest archives) will be the last to migrate away - due to logistics and cost. It's those kind of users who can benefit most from automated tools or services which increase the data integrity they achieve and cut down media waste and unrecoverable events.

New article - Data Integrity Challenges in flash SSD Design

Editor:- October 12, 2009 - today published a new article called - Data Integrity Challenges in flash SSD Design - written by Kent Smith Senior Director, Product Marketing, SandForce.

Since bursting onto the SSD scene in April 2009, SandForce has achieved remarkably high reader popularity. How did a company whose business is designing SSD controllers achieve this? - especially when the direct market for its products today numbers less than 1,000 oems.

The answer is - that if you want to know what the future of 2.5" enterprise SATA SSDs might look like -you have to look at the leading technology cores that will affect this market. Even if you're not planning to use SandForce based products yourself - you can't afford to ignore them - because they are setting the agenda in this market.

Reliability is the next new thing for SSD designers and users to start worrying about. A common theme you will hear from all fast SSD companies is that the faster you make an SSD go - the more effort you have to put into understanding and engineering data integrity to eliminate the risk of "silent errors." the article

Real World Reliability in High Performance Storage

Editor:- August 20, 2009 - Density Dynamics published a whitepaper called - Real World Reliability in High Performance Storage (pdf).

It compares real world failure rates for HDDs and flash SSDs with predicted MTBF and endurance data and suggests that the big discrepancies reported by users are due to the nature of their workloads. In this respect it suggests RAM SSDs are better in heavy IOPS apps - even taking into account the MTBFs of batteries and UPS like components.

It also cites my own article RAM Cache Ratios in flash SSDs.

Why Consumers Can Expect More Flaky Flash SSDs!

Editor:- August 10, 2009 - a new article published today on explains why the consumer flash SSD quality problem is not going to get better any time soon.

You know what I mean. Product recalls, firmware upgrades, performance downgrades and bad behavior which users did not anticipate from reading glowing magazine product reviews. And that's if they can get hold of the new products in the first place.

We predicted this unreliability scenario many years ago. And you have to get used to it. The new article explains why it's happening and gives some suggested workarounds for navigating in a world of imperfect flash SSD product marketing. the article

Ramtron's F-RAM Casualty of Auto Market Crash

Editor:- May 7, 2009 - Ramtron said its revenue declined 26% in the 1st quarter of 2009 compared to the year ago period.

A sharp decline in orders from the automotive market was cited as a principal cause.

Ramtron also announced an update on a legal suit related to in-field failures of one of its F-RAM memory products in an unspecified application. (In July 2008 Ramtron confirmed that specific batches of product had failed due to manufacturing process defects in one of its partners fabs.)

Ramtron also announced today that, over the next 2 years, it will transition the manufacturing of products that are currently being built at Fujitsu's chip foundry located in Iwate, Japan to its foundry at Texas Instruments in Dallas, Texas and to its newest foundry at IBM Corp in Essex Junction, Vermont.

Why You Need Better ECC Inside the SSD

Editor:- April 16, 2009 - this week SandForce published an article on the subject of effective error correction in flash SSDs.

I like it because it resonates well with the thinking that led me to publish this reliability page 3 years ago.

At that time - I was concerned with the theoretical inadequacy of error correction used inside hard drives. (Something which has since been confirmed in practice and reported in some of the papers cited at the top of this page.)

SandForce's short article shows you the consequences - in terms of uncorrectable errors - if you use "industry standard" strength ECC. And that's part of the sales pitch for their 10-to-the-minus-something-better errors protection in their new SSD controller.

How Good SSD Controllers Manage Flash Data Integrity

Editor:- April 3, 2009 - SNIA has published a new white paper - "NAND Flash Solid State Storage for the Enterprise - an in-depth Look at Reliability." (pdf)

It's co-authored by:- Jonathan Thatcher Fusion-io, Tom Coughlin Coughlin Associates, Jim Handy Objective Analysis and Neal Ekker Texas Memory Systems.

The article contains the best integrated explanation I've seen of the design trade-offs for error correction schemes and how they affect bit error rates compared to the raw uncorrected results. It goes on to explain the importance of the SSD controller and memory architecture (dispersing data among many chips) and how these can improve data integrity by managing read disturb errors. It also discusses wear-leveling and write amplification which have been well covered elsewhere. the article

See also:- SSD Reliability - Understanding Data Failure Modes in Large Solid State Storage Arrays

SSD Bookmarks from Texas Memory Systems

Editor:- March 16, 2009 - Texas Memory Systems' President, Woody Hutsell - shares his SSD Bookmarks with readers of

Those who know the SSD industry well, mostly think of TMS as a company which makes very fast SSDs for accelerating SAN resident applications. But in the many discussions I've had with Woody Hutsell during the past decade - "reliability" has also been a frequent topic in our conversations.

That's because when you manufacture products which pack more memory chips than anyone else has ever put into a single box - all those "10 to the minus something" numbers which relate physics to semiconductor memory effects - add up to design problems which are far from theoretical. TMS has been engineering solid state storage systems for 30 years. So I was not surprised to see an in depth paper about reliability being one of the articles in this list of bookmarks.

New Tool Acts as Bouncer for Up Market Tape Joints

Boulder, Colo. - February 3, 2009 - Spectra Logic has extended its Media Lifecycle Management technology outside the library with a new reader - now shipping.

The MLM Reader (approx $2,500) is a portable device that allows customers to check tape health on any computer through USB, without loading the tape into a library, and is designed to proactively identify faulty tape media before it is required for a data restore. It tracks over 30 non-volatile statistics about data tapes, such as export details; remaining capacity; encryption information; number of reads and writes; date of last access; born-on date; and cleaning log. ...Spectra Logic profile

SiliconSystems Proposes New Methodology for Realistically Predicting Flash SSD Reliability

Editor:- December 15, 2008 - Gary Drossel, VP Product Planning at SiliconSystems has written a new article - "NAND Evolution and its Effects on SSD Useable Life."

This is probably one of the 3 most significant articles on the subject of flash SSD reliability which have been published in recent years. Starting with a tour of the state of the art in the flash SSD market and technology the paper introduces several new concepts to help systems designers understand why current wear usage models don't give a complete picture.
  • Write amplification - is a measure of the efficiency of the SSD controller. Write amplification defines the number of writes the controller makes to the NAND for every write from the host system.
  • Wear-leveling efficiency - reflects the maximum deviation of the most-worn block to the least worn block over time.
The paper discusses the theoretical expected lifetimes and amplification factors for several applications and concludes that measurement of wear-out in real applications is the best way to understand what is happening. It suggests that systems designers can use the company's SiliconDrive (which includes real-time on-chip endurance monitoring) as an endurance analysis design tool. By simply plugging in SiliconDrive(s) to a new application for a day, week or month - the percentage of wear-out can be measured - and corrective steps taken (in software design or overprovisioning) to correct reliability problems.

What isn't stated in the article - but is a logical inference - is that even if your product design goal is to buy SSDs from other oems - the SiliconDrives can be used in your design process to capture information in a non invasive manner which is difficult or impossible to collect using other instrumentation. the article (pdf), ...SiliconSystems profile, storage reliability

iStor Unlocks High Availability Features in Installed iSCSI ASICs

IRVINE, Calif. - October 7, 2008 - iStor Networks, Inc. has begun shipping a new version of its software, v2.5, as a no-cost upgrade for all its iSCSI storage solutions.

This software will provide dual-controller iS512 systems with the ability to automatically detect malfunctions in the operational controller and to switch to the redundant controller without loss of data, function or performance.

"This new software capitalized on the patented capabilities of iStor's ASIC technology enabling HA capability with no impact upon system performance before, during or after a controller failure." said Jim Wayda, iStor's VP of Software Development. "iStor designed its controllers from the very beginning to deliver advanced functionality such as HA and we are very proud that we have been able to demonstrate the investment protection inherent in iStor's approach of implementation..." ...iStor profile, iSCSI, storage reliability

Can You Trust Your Flash SSD's Specs?

Editor:- July 9, 2008 - today published a new article which asks - Can you trust your flash SSD specs?

The flash SSD market opens up tremendous opportunities for systems integrators to leverage solid state disk technology. But due to the diversity of products in the market and lack of industry standards - it's got tremendous risks as well.

The product which you carefully qualified may not be identical to the one that's going into your production line for a variety of reasons... the article

Preparing for the Next Phase in the SSD Market Revolution

Editor:- June 25, 2008 today called for new papers on the theme - "Understanding Data Failure Modes in Large Solid State Storage Arrays".

Multi-terabyte solid state storage arrays are seeping into the server environment in the same way that RAID systems did back in the early 1990s.

But just as those RAID pioneers learned that there was a lot more to making a reliable disk array than stuffing a bunch of PC hard disks into a box with a fan and a power supply - so too will multi-terabyte SSD users discover that problems which are undetectable or do no harm in small SSDs can lead to serious data corruption risks when those same SSDs are scaled up without the right architecture and sometimes with it in place too.

I know from the emails I get that many readers think that once they've looked at the single issue of flash endurance - they've covered covered the bases for enterprise SSDs.

That's why is planning to publish a collection of definitive technology articles to help guide the industry through this risky transition process.

The new articles will provide users with the theoretical justifications they need when they are faced with the difficult economic choices that come from deploying different types of SSDs (with different cost models) in diverse applications within their organizations. the article

Disk Error Correction Company Gets $22 million Funding

Santa Clara, Calif. - April 9, 2008 - Link_A_Media Devices Corp secured $22 million in Series B financing.

The funding round, led by AIG SunAmerica Ventures, was secured from 4 additional financial and corporate investors - KeyNote Ventures, NEC Electronics, Micron and Seagate.

Link_A_Media Devices is developing a new class of chip controller resident data recovery solutions for HDDs and SSDs. These are designed to exceed the performance of conventional methods deployed in peripheral storage devices, as well as provide adaptive features that can be used during manufacturing to improve drive yields and product margins. ...Link_A_Media Devices profile

Editor's comments:-
MLC flash SSDs have high internal error rates and are currently unrecoverable. It looks like Link_A_Media's technology could improve the odds of data recovery in failed devices which incorporate its technology (as well as reducing data errors while the SSD is still operational.)

Another side effect of their technology may be better performance in flash SSDs.

Link_A_Media says their IOP Buster architecture enables scalability within the controller to address various segments of SSD applications seamlessly. It enables faster Read and Write transfers.

Spectra Libraries will Log Tape Health Metrics

SNW, ORLANDO, FL - April 8, 2008 - Spectra Logic announced details of its soon to be released new Media Lifecycle Management software for its tape library customers.

MLM will reduce backup failures by tracking more than 30 pieces of information about individual LTO tapes and logging this on on the tape's built in flash chip. Information such as: born-on date, number of reads and writes, error rate, media quality, date of last access, application usage, encryption information, cleaning log and remaining capacity are tracked. MLM and BlueScale are compatible with all major backup applications. ...Spectra Logic profile

Editor's comments:-
already past the decline and now in the fall years of the tape library market it looks like customers will get all kinds of useful information and services which they probably would have liked to have before. This sounds similar in concept to the SMART logs in hard disks and SiSMART in SiliconSystems' flash SSDs.

Pillar's Petabyte Arrays are 99.999% Available

San Jose, Calif. - April 7, 2008 - Pillar Data Systems today announced availability of the Pillar Axiom 500MC - a mission critical storage system .

The Pillar Axiom 500MC delivers up to 192GB of cache, with the ability to scale capacity to 1.6 petabytes. The system supports both fibre channel and SATA disk drives. Pillar guarantees 99.999% availability. ...Pillar profile

Does Unhappy Notebook Maker Have High Rate of SSD Flash Backs?

Editor:- March 19, 2008 - a report discussed in an article on CNET saying that flash SSDs in notebooks are incurring double digit customer reject rates has been dismissed by Dell as "untrue."

Study Enumerates Key Factors in Disk Array Failures

Editor:- March 6, 2008 - a recently published paper called - Are Disks the Dominant Contributor for Storage Failures? - reports on a 3 year study of nearly 2 million operating disks.

Among the many findings:- the annualized failure rate in near-line systems which mostly use SATA disks is approximately twice as high as in systems which mostly use fibre-channel disks. But other factors such as datapath resilience, presence or absence of RAID and reliability of the rack system components are just as significant contributors to storage reliability as the hard disks themselves. the article

Are MLC SSDs Ever Safe in Enterprise Apps?

Editor:- February 27, 2008 - published a new article today called - Are MLC SSDs Ever Safe in Enterprise Apps?

This is a follow up article to the popular SSD Myths and Legends which, in early 2007, demolished the myth that flash memory wear-out (a comfort blanket beloved by many RAM SSD makers) precluded the use of flash in heavy duty datacenters.

This new article looks at the risks posed by MLC Nand Flash SSDs which have recently hatched from their breeeding ground as chip modules in cellphones and morphed into hard disk form factors. It starts down a familiar lane but an unexpected technology twist (which arrived in my email this morning) takes you to a startling new world of possibilities. the article

WEDC Targets Medical CompactFlash Market

Phoenix, AZ - December 19, 2007 - White Electronic Designs Corp is leveraging its defense industry experience and expertise to develop high-reliability modules for the growing portable medical device market.

According to the U.S. Census Bureau, there will be an expected 40 million persons in the U.S. over the age of 65 by 2010, driving the need for portable medical devices, especially for home use. The portable medical device market is driven by the same requirements and expectations as the defense segment; such as high quality and reliability, shorter development cycles, a well-defined and documented supply chain and extended product lifecycles. Among other products WEDC designs and manufactures one of the industry's first medical series CompactFlash cards. ...White Electronic Designs profile

Editor's comments:- WEDC has also recently published a paper Is All CompactFlash Really Created Equal? (pdf) which uses the medical instrumentation market as the backdrop for a discussion about flash SSDs similar to those concerns analyzed in SSD Myths and Legends - "write endurance" - which looked at the enterprise server market.

Patent May Suit High Reliability SSD OEMs

MINNETONKA, MN - November 23, 2007 - ECC Technologies, Inc. announces that its parallel Reed-Solomon error correction designs and US Patent are immediately available for licensing.

PRS encoder and decoder designs allow parallel I/O storage devices to be designed with automatic, built-in backup (fault-tolerance). PRS applied to flash SSDs (for example) enables SSDs to be designed that can tolerate NAND Flash chip failures. PRS can also be applied to Hard Disk Arrays. Potential licensees can read about the PRS technology applied to SSDs and to HDDs on these preceding links. ...ECC Technologies profile, storage reliability

Editor's comments:-
in the early days of a fast growing technology market most vendors are too busy growing their revenue by selling products to customers. But when markets get big enough or growth rates slow down - another round kicks in - of harvesting money from those who succeeded in the market - but didn't protect themselves properly with patents.

When I was a young engineer several designs of mine did get patented. In one particular company I remember being asked to leaf through some 10 year old logbooks of my predecessors to find some prior art to help nullify a competitor's potential attack. I always preferred doing things my own way - so I grumbled at being asked to delve into these dusty old files. But I did find what my boss was looking for.

Panasas Solution Targets RAID Unreliability

FREMONT, CA - October 9, 2007 - Panasas, Inc. announced the Panasas Tiered Parity Architecture which the company claims is the most significant extension to disk array data reliability since Panasas CTO Garth Gibson's pioneering RAID research at UC-Berkeley in 1988.

With the release of the ActiveScale 3.2 operating environment, Panasas will offer an innovative end-to-end Tiered-Parity architecture that addresses the primary causes of storage reliability problems and provides the industry's first end-to-end data integrity checking capability.

Traditional RAID implementations protect against disk failures by calculating and storing parity data along with the original data.

In the past 10 years, individual disk drives have become approximately 10x more reliable and over 250x denser than those protected by the first generation RAID designs in the late 1980s. Unfortunately, the number of disk media failures expected during each read over the surface of a disk grows proportionately with the massive increase in density and has now become the most common failure mode for RAID. A RAID disk failure can cause loss of all the data in a volume which may be tens of terabytes or more. Recovery of the lost data from tape (assuming that is all backed up) can take days or even weeks.

Other storage system vendors recognize this same issue and apply RAID 6, often called double parity RAID, to address this problem. Double parity schemes only treat the symptom of the failure, not the cause, and they carry substantial cost and performance penalties, which will only get worse as disk drive densities continue to increase.

Panasas Tiered Parity architecture directly addresses the root cause of the problem, not the symptom. Solving the storage reliability problem caused by these new 1TB and larger disks allows Panasas to build larger and more reliable storage that allows users to get more value from their data and are less expensive for IT to support.

"The challenges with storage system reliability today have little to do with overall disk reliability, which is what RAID was designed to address in 1988. The issues that we see today are directly related to disk density and require new approaches. Most secondary disk failures today are the result of media errors, which have become 250x more likely to occur during a RAID failed-disk rebuild over the last 10 years," said Garth Gibson, CTO of Panasas. "Tiered Parity allows us to tackle media errors with an architecture that can counter the effects of increasing disk density. It also solves data path reliability challenges beyond those addressed by traditional RAID and extends parity checking out to the client or server node. Tiered Parity provides the only end-to-end data integrity checking capability in the industry." ...Panasas profile

Editor's comments:-
the problem of data corruption in large data sets because of obsolete technology assumptions built into hard disks, interface and RAID products has been looming for several years. You can see articles and research about this on the storage reliability page.

Is the solution more reliable hard drives? better interfaces? or a smarter storage OS? Users can't wait another 5 years for ideal solutions because the symptoms are there today when you look. The Panasas solution sounds like a pragmatic tactical approach for some customers - but the industry is a long way from a better storage reliability mousetrap.

Why Sun will Shine with a New Lustre

SANTA CLARA, Calif - September 12, 2007 - Sun Microsystems, Inc. today said it will acquire the majority of Cluster File Systems, Inc.'s intellectual property and business assets, including the Lustre File System.

Sun intends to add support for Solaris OS on Lustre and plans to continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware platforms. ...Sun Microsystems profile, Acquired storage companies

Editor's comments:-
I hadn't heard of this company before. A sure sign that they were heading straight for the gone away storage companies list without any deviations on route. Here's what I picked up from their web site present and past.

The Lustre product description (pdf) says - "the Lustre architecture was first developed at Carnegie Mellon University as a research project in 1999." The company's website started in about 2001 amd they released Lustre 1.0 in 2003. By 2004 had a product ready for a bigger market.

Strangely enough Solaris support isn't listed as a strong feature in their recent roadmap. So why does Sun want this technology? - Well - even if you're not in the supercomputer business - some technologies which start there eventually trickle down to the rest of us. "Zero single points of failure" - mentioned on their home page - is a good enough reason. As I wrote in my 7 year storage market predictions (2005) storage reliability is going to become a major headache in enterprise storage in the next 5 years.

See also:- Robin Harris's blog which explains the business background to CFS - "why aren't they rich?"

Tapewise Enterprise Checks Tape Media Errors

Farnborough UK - September 18, 2007 - Data Product Services today announces the release of Tapewise Enterprise.

Tapewise is software that writes data to a tape and then reads it again, tracking any errors, soft recoverable ones or unrecoverable ones, that occur. It streams a whole tape through a drive in this way and, with its Tape Error Map technology, produces a 3D graph showing errors encountered along the length of a tape when data was being read and written.

The user can decide what an acceptable error rate is and that boundary will be shown on the graph with any error rates above the user-defined norm instantly visible. The software supports a large number of tape formats: 3480; 3490; DLT; SDLT; 3590; 9840; 9940; T10000; LTOs 1, 2 and 3 and 3592. Costs start at $16,000 approx. A free 14-day evaluation copy is available. ...Data Product Services profile, Tape drives, Storage Testers

Noise Damping Techniques for PATA SSDs

Editor:- August 10, 2007 - SiliconSystems today published a new white paper called - "Noise Damping Techniques for PATA SSDs in Military-Embedded Systems."

This article looks at electronic signal integrity issues in integrating high speed PATA SSDs. It helps electronic designers understand how factors such as ground bounce, loading, power supply noise and signal trace mismatches can lead to false data or even device damage. Examples given in the tutorial style commentary include scope shots and logic analyzer traces. the article, ...SiliconSystems profile, storage chips, storage analyzers

Editor's comments:-
the article gives a good grounding (couldn't resist that one) in the signal quality factors needed to get high reliability operation and is equally relevant to hard disks. To simplify the 20 page document:- if you connect reliable electronic modules using unreliable signal paths - that will compromise the integrity of the data. Logic states are virtual - but digital signals are real and can have completely different shapes to what you expect if you don't follow basic rules.

Squeak! - Green Storage - What's Green. What's Not

Editor:- June 24, 2007 - today published a new article - Green Storage - Trends and Predictions.

There's a lot of nonsense in the media about so called "Green Storage". This article blows away the puffery and clears the air for a better view of forward looking green data storage technologies. Reliability gets an honorable mention. Find out what's really green - and what's not. the article

Hard Drive Unreliability Costs are Reason to Switch to SSDs

Aliso Viejo, Calif., May 30, 2007 - SiliconSystems, Inc. today announced the publication of a white paper called - "Solid-State Storage is a Cost-Effective Replacement for Hard Drives in Many Applications."

The paper cites data from Google and Carnegie Mellon University that indicates hard drive field failure rates are up to 15x greater than quoted in disk manufacturer data sheets. The white paper was developed by SiliconSystems to educate OEMs about the numerous technical and business decisions they must successfully navigate to select the best storage solution for their application. the article (pdf), ...SiliconSystems profile

Editor's note:- storage reliability is a type 4 application in our SSD Market Adoption Model.

Debunking Misconceptions in SSD Longevity

Editor:- May 11, 2007 - BiTMICRO Networks today published a new article called - "Debunking Misconceptions in SSD Longevity."

It cites lifetime predictions from my own popular article - SSD Myths and Legends - "write endurance" and fires a warning shot aimed at some competitors by saying "some flash SSD makers have even quoted higher write endurance ratings than those provided by manufacturers of their flash memory components."

That's certainly true - but I knew when writing my article that endurance varies from batch to batch of flash chips within the same semiconductor fab process. Some SSD oems sample test and reject chips which are at the lower end of the distribution curve. That means their worst case numbers are better than would be the case by simply accepting merchant quality flash chips. Although starting from a different base of assumptions - BiTMICRO's article "conclude(s) that fears about the endurance limitations of SSDs are rightfully fading away."

Seagate Drops Notebook Drives

SCOTTS VALLEY, Calif - March 12, 2007 - Seagate Technology today announced the worldwide availability of a 7,200 RPM hard drive with free-fall protection for beefed-up laptop durability.

Momentus 7200.2 delivers up to 160GB of capacity and has a SATA interface. The hard drive is also offered with an optional free-fall sensor to help prevent drive damage and data loss upon impact if a laptop PC is dropped. The sensor works by detecting any changes in acceleration equal to the force of gravity, then parking the head off the disc to prevent contact with the platter in a free fall of as little as 8 inches. ...Seagate profile

Editor's comments:-
Hitachi revealed details about its similar ESP drop sensor in 2005. The drop sensor approach is better than nothing, but doesn't get around the unavoidable fact that hard disks can break when dropped.

Another approach is that of Olixir Technologies who have marketed repackaged high performance hard drives which can be dropped repeatedly onto a concrete floor from 6 feet and still survive.

But solid state disks are inherently even tougher than that because there are no internal moving parts to crash together. That's why they have been used in space ships, helicoptors and missiles. In 2006 In-Stat predicted that half of all mobile computers would use SSDs (instead of hard disks) by 2013. It's not just the ruggedness and better power consumption. A video by Samsung demonstrates the advantages more graphically.

Hard Disk MTBF Specs Incredible - Say User Reports

Editor:- February 28, 2007 - an article published today in Channel Insider - "Hard Disk MTBF: Flap or Farce? - casts serious doubt on the inflated MTBF claims made by all hard disk manufacturers.

Reviewing a number of recently published reliability studies from end users - the author David Morgenstern says "...there's a gap between the reliability expectations of manufacturers and customers. The current MTBF model isn't accounting accurately for how drives are handled in the field and how they function inside systems." the article, storage reliability

Google Reports on HDD Reliability

Editor:- February 20, 2007 - Researchers at Google recently published a paper at the recent Usenix conference about hard disk reliability and failure prediction - based on their own experiences as a large user of hard disk drives.

The fascinating paper describes how Google measured available metrics and status reports generated by the drives themselves and how this correlated with actual failure patterns. One of the key insights in the report is Google's view of how useful SMART parameters were for predicting failures.

"Our results are surprising, if not somewhat disappointing. Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives... ...even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables." the article, Hard disk drives, storage reliability

PS - the measured data on the percentage of disks which fail each year over a 5 year cycle under various conditions is essential reading for disk to disk backup contingency planning.

Agere Halves Power Consumption for Mobile HDD Interface

ALLENTOWN, Pa - February 6, 2007 - Agere Systems has begun shipping a new fully functional 90-nanometer TrueStore read channel.

The TrueStore RC1300 uses half the current required by the previous generation of read channel chip technology in this market segment and is 25% faster. It targets the 1.8-inch and smaller HDD form factor that provides critical data storage of 20 to 160 gigabytes in a wide variety of consumer devices. ...Agere Systems profile Launches a New Strategic Directory - Storage Reliability

Editor:- June 20, 2006 - today launched a new directory dedicated to the subject of "Storage Reliability".

Reliability was named as one of the 3 most important future trends in storage in my state of the storage market article published last year. In that article I also predicted that uncorrectable failures in storage systems (due to embedded design assumptions made in earlier generations) could, if not dealt with by drive and interface designers, pose a more serious threat to enterprise computer systems than the Y2K bug in the late 1990s.

In addition to covering news about what the industry is doing to improve reliability in future drives, media and interfaces, STORAGEsearch has invited CTOs and technical directors of leading companies to write special articles about this subject - which will appear in the months ahead.

When most people think about storage reliability - they think about MTBF and thermal factors.

If an individual drive isn't reliable enough - wrap it in a RAID. If heat reduces the life of the disks - then cool them with more fans. If a memory system or interface is critical to an application - cocoon it with error detection and correction codes. Those are approaches which have worked adequately for the past few decades - but they are not good enough any more.

The demands for storage reliability are growing. Non stop applications need data that can be trusted to be available on demand. Compliance dictates that data should be readable not just years - but possibly decades after it was created. Meanwhile storage components, interfaces and systems are increasing in speed and capacity - while many of them are using error correction thinking that comes from earlier generations when data sets were smaller. As storage gets bigger - users face the risk of having uncorrectable errors in the heartland of their decision making data. That's why - all over the industry - manufacturers are starting to talk about new storage reliability initiatives.

There's also the risk that new storage technologies which get rushed to serve the needs of the consumer market - have not in fact been tested long enough to guarantee that they will not fail or start to corrupt data in the timeframe that enterprise customers care about.

Wrapping arrays of consumer disks based on new 2 year proven media technology in a big "enterprise" box - cannot guarantee that the data will still be readable in 5 years time. This is not a worry for consumers. They'll throw a failed disk away or buy a new one. But if your enterprise owns thousands of these disks (hidden by virtualization) it could be a big headache when the crumbly nature of the storage defects start to hit the news. This is another of the many concerns we'll be covering in these pages. Storage media have failed in the past and been withdrawn because they didn't meet their original extrapolated lifetimes. Lessons are not always learned from errors in the past - but can be forgotten and reoccur.

Storage reliability is changing. If you are interested - I hope you'll stay tuned to the new storage reliability channel here on the mouse site - as we report on these exciting developments in the months ahead.

Why Solaris will Get 128 Bit Addresses

Editor:- May 1, 2006 - an article today in discusses the Zettabyte File System - a new 128 bit addressing scheme for Solaris.

The article says that apart from the obvious advantage of being able to access more storage, Sun is apparently thinking about building in error correction into the new address scheme.

In a market forecast published last year in - Storage Reliability and failures were cited as one of the most important long term problems which oems and users will have to deal with.

The cause of the problem is that storage interfaces as well as modules and components (like disks, tapes, optical drives etc) use error correcting schemes which were designed for the much smaller and slower architectures of the past. As storage systems expand - new algorithms and correction schemes will be needed to guarantee that users don't get affected by data failures which are uncorrectable using today's products and protection schemes.

It's good to see that Sun is working proactively on one aspect of the problem. I've talked to many storage manufacturers about the upcoming reliability problem - which could be more serious than the Y2K threat - if not dealt with in advance. Sun is highly sensitive to data reliability concerns. Problems with its own SPARC server cache memory design back in 2001- were cited at the time by many large users as reasons for considering a switch to Intel and PowerPC based systems.

See also:- SPARC Product Directory

Hard Disk Sector Size May Change

SUNNYVALE, Calif - March 23, 2006 - IDEMA today announced the results of an industry committee assembled to identify a new and longer sector standard for future magnetic hard disk drives.

This Committee recommended replacing the 30 year-standard of 512 bytes with sectors having ability to store 4,096 bytes. Dr. Ed Grochowski, executive director of IDEMA US, reported that adopting a 4K byte sector length facilitates further increases in data density for hard drives which will increase storage capacity for users while continuing to reduce cost per gigabyte.

"Increasing areal density of newer magnetic hard disk drives requires a more robust error correction code, and this can be more efficiently applied to 4,096 byte sector lengths," explained Dr. Martin Hassner from Hitachi GST and IDEMA Committee member. ...IDEMA profile

Whitepaper Measures ROI of Disk Defragmentation

Burbank, CA - January 24, 2006 - Diskeeper recently sponsored IDC to write a whitepaper called - "Defragmentation's Hidden Value for the Enterprise."

This measured the ROI of defragmentation software in real customer sites. During the reliability test, the servers that were defragmenting files automatically had a higher uptime (5 to 10%) than the servers that didn't have defragmentation software automatically running. the article (pdf), ...Diskeeper profile

ProStor Systems Unveils New Backup Technology

BOULDER, CO - November 2, 2005 - ProStor Systems made its public debut today by introducing the firm's RDX removable disk backup technology.

The RDX removable cartridge uses the same 2.5" hard disk media platters found in notebook computers and provides initial capacity upto 400GB (compressed). That will will increase in line with conventional hard disk technology. But the difference is that RDX uses a new patent-pending error correcting format, which makes the data 1,000 times more recoverable than in a standard hard drive. ProStor says this means that RDX-stored data will be readable even after the cartridge has been archived and non-operating more than a decade. ...ProStor Systems profile, Removable Storage, Disk to disk backup, Storage People

Editor's comments:-
the reliability of embedded storage modules and components such as disk drives, tape drives and optical disks will become an important issue for users in the next 7 years.

These products rely on inbuilt error correction algorithms which were designed over a decade ago - when storage capacities were much smaller. All those "ten to the minus something" numbers which you see quoted for error rates sound good - except that when your enterprise is managing Petabytes of data, at every higher connection speeds, then you will start seeing uncorrectable data failures occurring every year - inside the storage, and beyond the scope of your RAID or other protection scheme to correct. ProStor is one of a new generation of storage manufacturers addressing this problem, and we'll soon publish a directory section dedicated to storage reliability issues such as this.

storage search banner

SSD news
data recovery
the SSD reliability papers
what's the state of DWPD?
high availability enterprise SSDs
what's some electrons more or less?
click to go to the storage reliability page
"Reliability is more than just MTBF...

and unlike Quality - it's not free.

The battle for storage reliability never stops. It has to be fought - in every place where physics intrudes on data integrity. It must be fought and won anew - in every technology generation and in every new product design."

Zsolt Kerekes, editor -

Flexxon SSDs for indistrial medical and automotive applications - overview image

IMA (Industrial, Medical & Automotive)
XTREME series SSDs - from Flexxon

can you always assume that newer storage is more reliable?
I created this dedicated storage reliability page here on in 2006. I had flagged Storage Reliability as a long term strategic concern for the market in a trends article (published in 2005) - in which I said that the risks posed by uncorrectable data failures due to systemic design flaws in storage drives "could be more serious than the Y2K bug threat - if not dealt with in advance."

Most people didn't understand what I was talking about. They (wrongly) assumed that they could always depend on oems to design a workable level of reliability into their storage products. And if that wasn't good enough - then a wraparound layer of RAID supported by some type of data backup would work well enough for their needs.

In 2010 - as we got sucked into the SSD market bubble we began to see more customer concerns about the poor reputation which some leading storage oems are acquiring - due to shipping undependable and incompletely verified SSD designs.

The intrinsic reliability of many types of storage products will get worse than they were before. So too will data integrity. The only way to get usable data systems from these raw materials is to use reliability oriented architecture and healing controller management.

As I said in 2005 - the assumption that storage reliability is a boring subject which enterprises don't need to worry about - will be shown to be wrong.

The only way to understand these trends and to avoid disasterous vendor choices is to read and understand more about this subject.
Data Integrity Study at CERN (pdf)
how fast can your SSD run backwards?
Latent Sector Errors in Hard Disk Drives
Increasing Flash Solid State Disk Reliability
SSD Myths and Legends - "write endurance"
Failure Trends in a Large Disk Drive Population (pdf)
reliability - editor mentions on
Are Disks the Dominant Contributor for Storage Failures?
Reliability Mechanisms for Very Large Storage Systems (pdf)
Reliability Modeling for Long Term Digital Preservation (pdf)
Empirical Measurements of Disk Failure Rates and Error Rates
Understanding Soft and Firm Errors in Semiconductor Devices (pdf)
Data Loss and Hard Drive Failure: Understanding the Causes and Costs
SSD ad - click for more info
A reader contacted me to say he was worried about the viability and reliability of large arrays of SSDs as used in the enterprise.

He said - "One thing that you don't touch on but SSD reliability engineers (a small discipline) do is the internal power conversion itself. The DCDC converters down-stream from the holdup caps or batteries also have a finite operational life time and certain specific failure mechanisms. If these fail, there is NO recovery since the power interruption is immediate."
FITs, reliability and abstraction levels in SSDs

SSD ad - click for more info

Data recovery from DRAM?
I thought everyone knew that

Can You Trust Your Flash SSD's Specs?
In 2008 I began to notice that the published specifications of flash SSDs change a lot -from the time a product they are first announced, then when they're being sampled, and later again when they are in volume production.

Sometimes the headline numbers get better, sometimes they get worse. There are many good reasons for this.

The product which you carefully qualified may not be identical to the one that's going into your production line for a variety of reasons...

And here's another thing to worry about...

The enterprise flash SSDs which you benchmarked yourself - may surprise you by running much slower when deployed in your own applications due to common "halo" errors which are implicit in the set ups of many performance test suites which were originally designed for HDDs. the article