click to visit StorageSearch.com home page
leading the way to the new storage frontier .....
click to see the collection of  SSD reliability articles here on StorageSearch.com
SSD reliability ..
SSD myths - write endurance
SSD endurance ..
high availabaility SSD arrays
HA / FT SSDs ..
hard disk drives news and articles
HDDs ..
image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics
SSD power loss ..
.

storage reliability - news & white papers

see also:- the SSD reliability papers
HA enterprise SSD arrays

Editor:- January 26, 2012 - due to the growing number of oems in the high availability rackmount SSD market StorageSearch.com today published a new directory focusing on HA enterprise SSD arrays.

The new directory will make it easier for users to locate specialist HA SSD vendors, related news and articles.


Pushing data reliability up hard drive hill

Editor:- July 4, 2011 - Why didn't hard drives get more reliable? Enterprise users are still replacing hard drives according to cycles that have haven't changed much since RAID became common in the 1990s. So why didn't HDD makers do something to make their drives better?

Error correction code inventor Phil White - founder of ECC Technologies has recently published a rant / blog in which he describes the 25 years of rejections he's had from leading HDD makers - and the reaons they said they didn't want to use his patented algorithm - which he says could increase data integrity and the life of hard drives (and maybe SSDs too.) It makes interesting reading for any other wannabe inventors out there too. ...read Phil White's article

But I think another reason for past rejections might simply have been market economics.

The capacity versus the cost of HDDs has improved so much throughout that period - and at the same time data capacity needs have grown - maybe the user value proposition didn't make sense.

If you (RAID user) find that all your 5 year old drives are still working (instead of being replaced) - how much is that really worth? By now those 5 year old drives might only represent 3% to 10% of the new storage capacity you need anyway. (The reliability value proposition is different outside service engineer frequented zone - but I don't want to get side-tracked into SSD market models here.)

Looking ahead at the future of the HDD market my own view is that whatever the industry does with respect to reliability won't tip the balance against SSDs in the enterprise.

The best bet for the future of hard drive makers is in consumer products where fashion ranks higher up the reason to buy list than longevity. Most people I know replace their notebook pcs, tvs and phones not because the old ones have stopped working - but because the new ones have lifestyle features which make them more desirable.


optimizing SSD architecture to cope with flash plane errors

Editor:- May 26, 2011 - a new slant on SSD reliability architectures is revealed today by Texas Memory Systems who explained how their patented Variable Stripe RAID technology is used in their recently launched PCIe SSD card - the RamSan-70.

TMS does a 1 month burn-in of flash memory prior to shipment. (One of the reasons cited for its use of SLC rather than MLC BTW.) Through its QA processes the company has acquired real-world failure data for several generations of flash memory and used this to model and characterize the failure modes which occur in high IOPs SSDs.

Most enterprise SSDs use a simple type of classic RAID which groups flash media into "stripes" containing equal numbers of chips. RAID technology can reconstruct data from a failed Flash chip. Typically, when a chip or part of a chip fails, the RAID algorithm uses a spare chip as a virtual replacement for the broken chip. But once the SSD is out of spare chips, it needs to be replaced.

VSR technology allows the number of chips to vary among stripes, so bad chips can simply be bypassed using a smaller stripe size. Additionally, VSR provides greater stripe size granularity, so a stripe could exclude a small part of a chip rather than having to exclude an entire chip if only part of it failed - "plane error". With VSR technology, TMS says its SSD products will continue operating longer in the installed base.

Dan Scheel, President of Texas Memory Systems explained why their technology increases reliability.

"...Consider a hypothetical SSD made up of 25 individual flash chips. If a plane failure occurs that disables 1/8 of one chip, a traditional RAID system would remove a full 4% of the raw Flash capacity. TMS VSR technology bypasses the failure and only reduces the raw flash capacity by 0.5%, an 8x improvement. TMS tests show that plane failures are the 2nd most common kind of flash device failures, so it is very important to be able to handle them without wasting working flash."

Editor's comments:- by wasting less capacity than simpler RAID solutions - more usable capacity remains available for traditional bad block management. This extra capacity comes from the over provisioning budget which figure varies according to each SSD design (as discussed in my recent flash iceberg syndrome article) but is 30% for TMS.


what happens in SSDs when power goes down? - and why you should care

Editor:- February 24, 2011 - StorageSearch.com today published a new article - SSD power is going down! - which surveys power down management design factors in SSDs.

Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - is really important in determining SSD data integrity and operational reliability. This article will help you understand why some SSDs which work perfectly well in one type of application might fail in others... even when the changes in the operational environment appear to be negligible. If you thought endurance was the end of the SSD reliability story - think again. ...read the article


Business opportunities from Intel's imperfect bridge chips

Editor:- February 9, 2011 - Intel Knowingly Sells Faulty Chipsets. are they Crazy? is a new article on PCWorld.com which discusses how Intel is dealing with the issue of a bridge chip with known defects in some SATA ports.

I rarely read that publication because my interests are enterprise storage and SSDs - but the author Keir Thomas had linked to StorageSearch.com from another recent article he wrote - Seagate: SSDs are Doomed (at Least for Now) - which showed up in my web stats.

When I started my storage reliability directory in 2006 - I knew that large storage vendors would ship flaky SSDs and hard drives - but I assumed that would be due to the unwitting and creeping use of inappropriate design and testing methodologies - rather than deliberate business decisions.

Another characteristic of this Intel chip is that if oems populate all the RAM slots which it "supports" - the speed drops down to unattractive levels.

But that's not bad news for everyone. Adrian Proctor, VP of of Marketing at Viking told me last month it means there's a growing population of DIMM slots on motherboards which can't be used for RAM - but could be used instead to save space and power by installing their SATADIMM SSDs to replace HDDs as boot drives. Other companies make 1 inch and smaller SSDs too.


comparing SSD and HDD failure rates in retail

Editor:- December 10, 2010 - the failure rates for SSDs and hard drives in the retail channel are compared in a recent article which is part of a regular feature on the French website HARDWARE.FR. Because many consumer SSD designs have been flaky - the apparent similarities suggested in the French report should not be taken to be typical of SSDs as a whole.

On the contrary - a much bigger difference in field reliability is suggested by the business models of industrial SSD makers and enterprise server SSD makers for whom better reliability is part of the value proposition - and by anecdotal reports which I've had from many data recovery companies.


10,000x more reliable than RAID?

Editor:- August 26, 2010 - Amplidata claims that its BitSpread technology is 10,000x more reliable than current RAID based technologies and requires 3x less storage.

Is another new way of fixing reliability problems in hard disk arrays worth the effort just as we approach the end of the hard disk market's life? - I doubt it. See why in - this way to the petabyte SSD.


how to make "SSD reliability" believable - marketing case study

Editor:- July 29, 2010 - StorageSearch.com today published a new article - the cultivation and nurturing of "reliability" in a 2.5" SSD brand.

Reliability is an important factor in many applications which use SSDs. But can you trust an SSD brand just because it claims to be reliable?

As we've seen in recent years - in the rush for the SSD market bubble - many design teams which previously had little or no experience of SSDs were tasked with designing such products - and the result has been successive waves of flaky SSDs and SSDs whose specifications couldn't be relied on to remain stable and in many products quickly degraded in customer sites.

As part of an education series for SSD product marketers - this new case study describes how one company - which didn't have the conventional background to start off with - managed to equate their brand of SSD with reliability in the minds of designers in the embedded systems market. ...read the article


Anobit aims at SandForce SSD SoCs slots

Editor:- June 15, 2010 - Anobit announced it is sampling SSDs based on its patented Memory Signal Processing technology which provide 20x improvement in operational life for MLC SSDs in high IOPS server environments.

Based on proprietary algorithms that compensate for the physical limitations of NAND flash, Anobit's MSP technology extends standard MLC endurance from approximately 3K read/write cycles to over 50K cycles - to make MLC technology suitable for high-duty cycle applications. This guarantees drive write endurance of 10 full disk writes per day, for 5 years, or 7,300TBs for a 400GB drive, with fully random data (worst-case conditions).

First-generation Anobit Genesis SSDs deliver 20,000 IOPS random write and 30,000 IOPS random read, with 180MB/s sustained write and 220MB/s sustained read.

Anobit says that some of the world's largest NAND manufacturers, consumer electronics vendors and storage solution providers currently utilize Anobit's MSP technology in their products.

"For too long, the high prices of SLC SSDs and concerns about MLC SSD endurance have slowed the adoption of flash memory storage in the enterprise. Anobit Genesis SSDs effectively neutralize both of these concerns," said Prof. Ehud Weinstein, Anobit CEO. "By delivering true enterprise-class SSD reliability at affordable MLC SSD prices, Anobit Genesis SSDs unlock the full promise of solid-state enterprise storage."

Editor's comments:- superficially the endurance delivered by Anobit's SSD controller is better than that obtainable from SandForce - whereas the performance lead is the other way around. For most oems what will be more important is that they do not need to be locked into a single technology supplier to get adequate metrics for their MLC SSD product lines.


flash SSD integrity architectures for space-craft

Editor:- April 13, 2010 - for those interested in flash SSD data integrity issues - Phil White, President of ECC Technologies has released a white paper - NAND Flash Memories for Spacecraft (doc).

Phil has been working with ECC for almost 37 years and his company is developing future ECC designs to allow systems architects to develop NAND flash memories that are highly reliable and fault-tolerant even if the NAND flash chips themselves are not so reliable.

NASA is using ECC Tek's designs in multiple missions. 2 of the designs are in space at the present time and are working perfectly. Phil White recently wrote a document for NASA and JPL which outlines how to design NAND Flash memories for spacecraft. The 22 page "preview" document excludes confidential data but gives a taste of the technology available for licensing. ...read the article


XLC promises "enterprise" hybrid x4 SSDs

Editor:- April 1, 2010 - XLC Disk announced details of a paper it will discuss later this month at the NV Memories Worskhop (UC San Diego) called - "Paramagnetic Effects on Trapped Charge Diffusion with Applications for x4 Data Integrity."

The company says its findings could have applications in the enterprise storage market by solving the data integrity problems in x4 MLC SSDs within a new class of hybrid storage drives. ...read more


New Integrity Tool for Old Tape Archives

Editor:- January 18, 2010 - Crossroads Systems today announced details of ArchiveVerify - a new monitoring option for its ReadVerify Appliance that safeguards the future readability of data backed up on tape.

"In our experience, the Achilles' heel of a data recovery strategy is often the uncertainty of the data's readability, and this single point of failure can render then entire restore process useless," adds Bernd Krieger, Managing Director, at Crossroads Europe.

Editor's comments:- Crossroads was originally a specialist in the SAN router business. In recent years it has done a lot of work in the area of storage reliability. I've read lots of their whitepapers which describe their research and products addressing data integrity. Although there has been a historic trend for users to migrate away from tape to disk backup - many super users of huge tape libraries (with the biggest archives) will be the last to migrate away - due to logistics and cost. It's those kind of users who can benefit most from automated tools or services which increase the data integrity they achieve and cut down media waste and unrecoverable events.


New article - Data Integrity Challenges in flash SSD Design

Editor:- October 12, 2009 - StorageSearch.com today published a new article called - Data Integrity Challenges in flash SSD Design - written by Kent Smith Senior Director, Product Marketing, SandForce.

Since bursting onto the SSD scene in April 2009, SandForce has achieved remarkably high reader popularity. How did a company whose business is designing SSD controllers achieve this? - especially when the direct market for its products today numbers less than 1,000 oems.

The answer is - that if you want to know what the future of 2.5" enterprise SATA SSDs might look like -you have to look at the leading technology cores that will affect this market. Even if you're not planning to use SandForce based products yourself - you can't afford to ignore them - because they are setting the agenda in this market.

Reliability is the next new thing for SSD designers and users to start worrying about. A common theme you will hear from all fast SSD companies is that the faster you make an SSD go - the more effort you have to put into understanding and engineering data integrity to eliminate the risk of "silent errors." ...read the article


Real World Reliability in High Performance Storage

Editor:- August 20, 2009 - Density Dynamics published a whitepaper called - Real World Reliability in High Performance Storage (pdf).

It compares real world failure rates for HDDs and flash SSDs with predicted MTBF and endurance data and suggests that the big discrepancies reported by users are due to the nature of their workloads. In this respect it suggests RAM SSDs are better in heavy IOPS apps - even taking into account the MTBFs of batteries and UPS like components.

It also cites my own article RAM Cache Ratios in flash SSDs.


Why Consumers Can Expect More Flaky Flash SSDs!

Editor:- August 10, 2009 - a new article published today on StorageSearch.com explains why the consumer flash SSD quality problem is not going to get better any time soon.

You know what I mean. Product recalls, firmware upgrades, performance downgrades and bad behavior which users did not anticipate from reading glowing magazine product reviews. And that's if they can get hold of the new products in the first place.

We predicted this unreliability scenario many years ago. And you have to get used to it. The new article explains why it's happening and gives some suggested workarounds for navigating in a world of imperfect flash SSD product marketing. ...read the article


Ramtron's F-RAM Casualty of Auto Market Crash

Editor:- May 7, 2009 - Ramtron said its revenue declined 26% in the 1st quarter of 2009 compared to the year ago period.

A sharp decline in orders from the automotive market was cited as a principal cause.

Ramtron also announced an update on a legal suit related to in-field failures of one of its F-RAM memory products in an unspecified application. (In July 2008 Ramtron confirmed that specific batches of product had failed due to manufacturing process defects in one of its partners fabs.)

Ramtron also announced today that, over the next 2 years, it will transition the manufacturing of products that are currently being built at Fujitsu's chip foundry located in Iwate, Japan to its foundry at Texas Instruments in Dallas, Texas and to its newest foundry at IBM Corp in Essex Junction, Vermont.


Why You Need Better ECC Inside the SSD

Editor:- April 16, 2009 - this week SandForce published an article on the subject of effective error correction in flash SSDs.

I like it because it resonates well with the thinking that led me to publish this reliability page 3 years ago.

At that time - I was concerned with the theoretical inadequacy of error correction used inside hard drives. (Something which has since been confirmed in practice and reported in some of the papers cited at the top of this page.)

SandForce's short article shows you the consequences - in terms of uncorrectable errors - if you use "industry standard" strength ECC. And that's part of the sales pitch for their 10-to-the-minus-something-better errors protection in their new SSD controller.


How Good SSD Controllers Manage Flash Data Integrity

Editor:- April 3, 2009 - SNIA has published a new white paper - "NAND Flash Solid State Storage for the Enterprise - an in-depth Look at Reliability." (pdf)

It's co-authored by:- Jonathan Thatcher Fusion-io, Tom Coughlin Coughlin Associates, Jim Handy Objective Analysis and Neal Ekker Texas Memory Systems.

The article contains the best integrated explanation I've seen of the design trade-offs for error correction schemes and how they affect bit error rates compared to the raw uncorrected results. It goes on to explain the importance of the SSD controller and memory architecture (dispersing data among many chips) and how these can improve data integrity by managing read disturb errors. It also discusses wear-leveling and write amplification which have been well covered elsewhere. ...read the article

See also:- SSD Reliability - Understanding Data Failure Modes in Large Solid State Storage Arrays


SSD Bookmarks from Texas Memory Systems

Editor:- March 16, 2009 - Texas Memory Systems' President, Woody Hutsell - shares his SSD Bookmarks with readers of StorageSearch.com.

Those who know the SSD industry well, mostly think of TMS as a company which makes very fast SSDs for accelerating SAN resident applications. But in the many discussions I've had with Woody Hutsell during the past decade - "reliability" has also been a frequent topic in our conversations.

That's because when you manufacture products which pack more memory chips than anyone else has ever put into a single box - all those "10 to the minus something" numbers which relate physics to semiconductor memory effects - add up to design problems which are far from theoretical. TMS has been engineering solid state storage systems for 30 years. So I was not surprised to see an in depth paper about reliability being one of the articles in this list of bookmarks.


New Tool Acts as Bouncer for Up Market Tape Joints

Boulder, Colo. - February 3, 2009 - Spectra Logic has extended its Media Lifecycle Management technology outside the library with a new reader - now shipping.

The MLM Reader (approx $2,500) is a portable device that allows customers to check tape health on any computer through USB, without loading the tape into a library, and is designed to proactively identify faulty tape media before it is required for a data restore. It tracks over 30 non-volatile statistics about data tapes, such as export details; remaining capacity; encryption information; number of reads and writes; date of last access; born-on date; and cleaning log. ...Spectra Logic profile


SiliconSystems Proposes New Methodology for Realistically Predicting Flash SSD Reliability

Editor:- December 15, 2008 - Gary Drossel, VP Product Planning at SiliconSystems has written a new article - "NAND Evolution and its Effects on SSD Useable Life."

This is probably one of the 3 most significant articles on the subject of flash SSD reliability which have been published in recent years. Starting with a tour of the state of the art in the flash SSD market and technology the paper introduces several new concepts to help systems designers understand why current wear usage models don't give a complete picture.
  • Write amplification - is a measure of the efficiency of the SSD controller. Write amplification defines the number of writes the controller makes to the NAND for every write from the host system.
  • Wear-leveling efficiency - reflects the maximum deviation of the most-worn block to the least worn block over time.
The paper discusses the theoretical expected lifetimes and amplification factors for several applications and concludes that measurement of wear-out in real applications is the best way to understand what is happening. It suggests that systems designers can use the company's SiliconDrive (which includes real-time on-chip endurance monitoring) as an endurance analysis design tool. By simply plugging in SiliconDrive(s) to a new application for a day, week or month - the percentage of wear-out can be measured - and corrective steps taken (in software design or overprovisioning) to correct reliability problems.

What isn't stated in the article - but is a logical inference - is that even if your product design goal is to buy SSDs from other oems - the SiliconDrives can be used in your design process to capture information in a non invasive manner which is difficult or impossible to collect using other instrumentation. ...read the article (pdf), ...SiliconSystems profile, storage reliability


iStor Unlocks High Availability Features in Installed iSCSI ASICs

IRVINE, Calif. - October 7, 2008 - iStor Networks, Inc. has begun shipping a new version of its software, v2.5, as a no-cost upgrade for all its iSCSI storage solutions.

This software will provide dual-controller iS512 systems with the ability to automatically detect malfunctions in the operational controller and to switch to the redundant controller without loss of data, function or performance.

"This new software capitalized on the patented capabilities of iStor's ASIC technology enabling HA capability with no impact upon system performance before, during or after a controller failure." said Jim Wayda, iStor's VP of Software Development. "iStor designed its controllers from the very beginning to deliver advanced functionality such as HA and we are very proud that we have been able to demonstrate the investment protection inherent in iStor's approach of implementation..." ...iStor profile, iSCSI, storage reliability


Can You Trust Your Flash SSD's Specs?

Editor:- July 9, 2008 - STORAGEsearch.com today published a new article which asks - Can you trust your flash SSD specs?

The flash SSD market opens up tremendous opportunities for systems integrators to leverage solid state disk technology. But due to the diversity of products in the market and lack of industry standards - it's got tremendous risks as well.

The product which you carefully qualified may not be identical to the one that's going into your production line for a variety of reasons... ...read the article


Preparing for the Next Phase in the SSD Market Revolution

Editor:- June 25, 2008 -STORAGEsearch.com today called for new papers on the theme - "Understanding Data Failure Modes in Large Solid State Storage Arrays".

Multi-terabyte solid state storage arrays are seeping into the server environment in the same way that RAID systems did back in the early 1990s.

But just as those RAID pioneers learned that there was a lot more to making a reliable disk array than stuffing a bunch of PC hard disks into a box with a fan and a power supply - so too will multi-terabyte SSD users discover that problems which are undetectable or do no harm in small SSDs can lead to serious data corruption risks when those same SSDs are scaled up without the right architecture and sometimes with it in place too.

I know from the emails I get that many readers think that once they've looked at the single issue of flash endurance - they've covered covered the bases for enterprise SSDs.

That's why storagesearch.com is planning to publish a collection of definitive technology articles to help guide the industry through this risky transition process.

The new articles will provide users with the theoretical justifications they need when they are faced with the difficult economic choices that come from deploying different types of SSDs (with different cost models) in diverse applications within their organizations. ...read the article


Disk Error Correction Company Gets $22 million Funding

Santa Clara, Calif. - April 9, 2008 - Link_A_Media Devices Corp secured $22 million in Series B financing.

The funding round, led by AIG SunAmerica Ventures, was secured from 4 additional financial and corporate investors - KeyNote Ventures, NEC Electronics, Micron and Seagate.

Link_A_Media Devices is developing a new class of chip controller resident data recovery solutions for HDDs and SSDs. These are designed to exceed the performance of conventional methods deployed in peripheral storage devices, as well as provide adaptive features that can be used during manufacturing to improve drive yields and product margins. ...Link_A_Media Devices profile

Editor's comments:-
MLC flash SSDs have high internal error rates and are currently unrecoverable. It looks like Link_A_Media's technology could improve the odds of data recovery in failed devices which incorporate its technology (as well as reducing data errors while the SSD is still operational.)

Another side effect of their technology may be better performance in flash SSDs.

Link_A_Media says their IOP Buster architecture enables scalability within the controller to address various segments of SSD applications seamlessly. It enables faster Read and Write transfers.


Spectra Libraries will Log Tape Health Metrics

SNW, ORLANDO, FL - April 8, 2008 - Spectra Logic announced details of its soon to be released new Media Lifecycle Management software for its tape library customers.

MLM will reduce backup failures by tracking more than 30 pieces of information about individual LTO tapes and logging this on on the tape's built in flash chip. Information such as: born-on date, number of reads and writes, error rate, media quality, date of last access, application usage, encryption information, cleaning log and remaining capacity are tracked. MLM and BlueScale are compatible with all major backup applications. ...Spectra Logic profile

Editor's comments:-
already past the decline and now in the fall years of the tape library market it looks like customers will get all kinds of useful information and services which they probably would have liked to have before. This sounds similar in concept to the SMART logs in hard disks and SiSMART in SiliconSystems' flash SSDs.


Pillar's Petabyte Arrays are 99.999% Available

San Jose, Calif. - April 7, 2008 - Pillar Data Systems today announced availability of the Pillar Axiom 500MC - a mission critical storage system .

The Pillar Axiom 500MC delivers up to 192GB of cache, with the ability to scale capacity to 1.6 petabytes. The system supports both fibre channel and SATA disk drives. Pillar guarantees 99.999% availability. ...Pillar profile


Does Unhappy Notebook Maker Have High Rate of SSD Flash Backs?

Editor:- March 19, 2008 - a report discussed in an article on CNET saying that flash SSDs in notebooks are incurring double digit customer reject rates has been dismissed by Dell as "untrue."


Study Enumerates Key Factors in Disk Array Failures

Editor:- March 6, 2008 - a recently published paper called - Are Disks the Dominant Contributor for Storage Failures? - reports on a 3 year study of nearly 2 million operating disks.

Among the many findings:- the annualized failure rate in near-line systems which mostly use SATA disks is approximately twice as high as in systems which mostly use fibre-channel disks. But other factors such as datapath resilience, presence or absence of RAID and reliability of the rack system components are just as significant contributors to storage reliability as the hard disks themselves. ...read the article


Are MLC SSDs Ever Safe in Enterprise Apps?

Editor:- February 27, 2008 - STORAGEsearch.com published a new article today called - Are MLC SSDs Ever Safe in Enterprise Apps?

This is a follow up article to the popular SSD Myths and Legends which, in early 2007, demolished the myth that flash memory wear-out (a comfort blanket beloved by many RAM SSD makers) precluded the use of flash in heavy duty datacenters.

This new article looks at the risks posed by MLC Nand Flash SSDs which have recently hatched from their breeeding ground as chip modules in cellphones and morphed into hard disk form factors. It starts down a familiar lane but an unexpected technology twist (which arrived in my email this morning) takes you to a startling new world of possibilities. ...read the article


WEDC Targets Medical CompactFlash Market

Phoenix, AZ - December 19, 2007 - White Electronic Designs Corp is leveraging its defense industry experience and expertise to develop high-reliability modules for the growing portable medical device market.

According to the U.S. Census Bureau, there will be an expected 40 million persons in the U.S. over the age of 65 by 2010, driving the need for portable medical devices, especially for home use. The portable medical device market is driven by the same requirements and expectations as the defense segment; such as high quality and reliability, shorter development cycles, a well-defined and documented supply chain and extended product lifecycles. Among other products WEDC designs and manufactures one of the industry's first medical series CompactFlash cards. ...White Electronic Designs profile

Editor's comments:- WEDC has also recently published a paper Is All CompactFlash Really Created Equal? (pdf) which uses the medical instrumentation market as the backdrop for a discussion about flash SSDs similar to those concerns analyzed in SSD Myths and Legends - "write endurance" - which looked at the enterprise server market.


Patent May Suit High Reliability SSD OEMs

MINNETONKA, MN - November 23, 2007 - ECC Technologies, Inc. announces that its parallel Reed-Solomon error correction designs and US Patent are immediately available for licensing.

PRS encoder and decoder designs allow parallel I/O storage devices to be designed with automatic, built-in backup (fault-tolerance). PRS applied to flash SSDs (for example) enables SSDs to be designed that can tolerate NAND Flash chip failures. PRS can also be applied to Hard Disk Arrays. Potential licensees can read about the PRS technology applied to SSDs and to HDDs on these preceding links. ...ECC Technologies profile, storage reliability

Editor's comments:-
in the early days of a fast growing technology market most vendors are too busy growing their revenue by selling products to customers. But when markets get big enough or growth rates slow down - another round kicks in - of harvesting money from those who succeeded in the market - but didn't protect themselves properly with patents.

When I was a young engineer several designs of mine did get patented. In one particular company I remember being asked to leaf through some 10 year old logbooks of my predecessors to find some prior art to help nullify a competitor's potential attack. I always preferred doing things my own way - so I grumbled at being asked to delve into these dusty old files. But I did find what my boss was looking for.


Panasas Solution Targets RAID Unreliability

FREMONT, CA - October 9, 2007 - Panasas, Inc. announced the Panasas Tiered Parity Architecture which the company claims is the most significant extension to disk array data reliability since Panasas CTO Garth Gibson's pioneering RAID research at UC-Berkeley in 1988.

With the release of the ActiveScale 3.2 operating environment, Panasas will offer an innovative end-to-end Tiered-Parity architecture that addresses the primary causes of storage reliability problems and provides the industry's first end-to-end data integrity checking capability.

Traditional RAID implementations protect against disk failures by calculating and storing parity data along with the original data.

In the past 10 years, individual disk drives have become approximately 10x more reliable and over 250x denser than those protected by the first generation RAID designs in the late 1980s. Unfortunately, the number of disk media failures expected during each read over the surface of a disk grows proportionately with the massive increase in density and has now become the most common failure mode for RAID. A RAID disk failure can cause loss of all the data in a volume which may be tens of terabytes or more. Recovery of the lost data from tape (assuming that is all backed up) can take days or even weeks.

Other storage system vendors recognize this same issue and apply RAID 6, often called double parity RAID, to address this problem. Double parity schemes only treat the symptom of the failure, not the cause, and they carry substantial cost and performance penalties, which will only get worse as disk drive densities continue to increase.

Panasas Tiered Parity architecture directly addresses the root cause of the problem, not the symptom. Solving the storage reliability problem caused by these new 1TB and larger disks allows Panasas to build larger and more reliable storage that allows users to get more value from their data and are less expensive for IT to support.

"The challenges with storage system reliability today have little to do with overall disk reliability, which is what RAID was designed to address in 1988. The issues that we see today are directly related to disk density and require new approaches. Most secondary disk failures today are the result of media errors, which have become 250x more likely to occur during a RAID failed-disk rebuild over the last 10 years," said Garth Gibson, CTO of Panasas. "Tiered Parity allows us to tackle media errors with an architecture that can counter the effects of increasing disk density. It also solves data path reliability challenges beyond those addressed by traditional RAID and extends parity checking out to the client or server node. Tiered Parity provides the only end-to-end data integrity checking capability in the industry." ...Panasas profile

Editor's comments:-
the problem of data corruption in large data sets because of obsolete technology assumptions built into hard disks, interface and RAID products has been looming for several years. You can see articles and research about this on the storage reliability page.

Is the solution more reliable hard drives? better interfaces? or a smarter storage OS? Users can't wait another 5 years for ideal solutions because the symptoms are there today when you look. The Panasas solution sounds like a pragmatic tactical approach for some customers - but the industry is a long way from a better storage reliability mousetrap.


Why Sun will Shine with a New Lustre

SANTA CLARA, Calif - September 12, 2007 - Sun Microsystems, Inc. today said it will acquire the majority of Cluster File Systems, Inc.'s intellectual property and business assets, including the Lustre File System.

Sun intends to add support for Solaris OS on Lustre and plans to continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware platforms. ...Sun Microsystems profile, Acquired storage companies

Editor's comments:-
I hadn't heard of this company before. A sure sign that they were heading straight for the gone away storage companies list without any deviations on route. Here's what I picked up from their web site present and past.

The Lustre product description (pdf) says - "the Lustre architecture was first developed at Carnegie Mellon University as a research project in 1999." The company's website started in about 2001 amd they released Lustre 1.0 in 2003. By 2004 had a product ready for a bigger market.

Strangely enough Solaris support isn't listed as a strong feature in their recent roadmap. So why does Sun want this technology? - Well - even if you're not in the supercomputer business - some technologies which start there eventually trickle down to the rest of us. "Zero single points of failure" - mentioned on their home page - is a good enough reason. As I wrote in my 7 year storage market predictions (2005) storage reliability is going to become a major headache in enterprise storage in the next 5 years.

See also:- Robin Harris's blog which explains the business background to CFS - "why aren't they rich?"


Tapewise Enterprise Checks Tape Media Errors

Farnborough UK - September 18, 2007 - Data Product Services today announces the release of Tapewise Enterprise.

Tapewise is software that writes data to a tape and then reads it again, tracking any errors, soft recoverable ones or unrecoverable ones, that occur. It streams a whole tape through a drive in this way and, with its Tape Error Map technology, produces a 3D graph showing errors encountered along the length of a tape when data was being read and written.

The user can decide what an acceptable error rate is and that boundary will be shown on the graph with any error rates above the user-defined norm instantly visible. The software supports a large number of tape formats: 3480; 3490; DLT; SDLT; 3590; 9840; 9940; T10000; LTOs 1, 2 and 3 and 3592. Costs start at $16,000 approx. A free 14-day evaluation copy is available. ...Data Product Services profile, Tape drives, Storage Testers


Noise Damping Techniques for PATA SSDs

Editor:- August 10, 2007 - SiliconSystems today published a new white paper called - "Noise Damping Techniques for PATA SSDs in Military-Embedded Systems."

This article looks at electronic signal integrity issues in integrating high speed PATA SSDs. It helps electronic designers understand how factors such as ground bounce, loading, power supply noise and signal trace mismatches can lead to false data or even device damage. Examples given in the tutorial style commentary include scope shots and logic analyzer traces. ...read the article, ...SiliconSystems profile, storage chips, storage analyzers

Editor's comments:-
the article gives a good grounding (couldn't resist that one) in the signal quality factors needed to get high reliability operation and is equally relevant to hard disks. To simplify the 20 page document:- if you connect reliable electronic modules using unreliable signal paths - that will compromise the integrity of the data. Logic states are virtual - but digital signals are real and can have completely different shapes to what you expect if you don't follow basic rules.


Squeak! - Green Storage - What's Green. What's Not

Editor:- June 24, 2007 - STORAGEsearch.com today published a new article - Green Storage - Trends and Predictions.

There's a lot of nonsense in the media about so called "Green Storage". This article blows away the puffery and clears the air for a better view of forward looking green data storage technologies. Reliability gets an honorable mention. Find out what's really green - and what's not. ...read the article


Hard Drive Unreliability Costs are Reason to Switch to SSDs

Aliso Viejo, Calif., May 30, 2007 - SiliconSystems, Inc. today announced the publication of a white paper called - "Solid-State Storage is a Cost-Effective Replacement for Hard Drives in Many Applications."

The paper cites data from Google and Carnegie Mellon University that indicates hard drive field failure rates are up to 15x greater than quoted in disk manufacturer data sheets. The white paper was developed by SiliconSystems to educate OEMs about the numerous technical and business decisions they must successfully navigate to select the best storage solution for their application. ...read the article (pdf), ...SiliconSystems profile

Editor's note:- storage reliability is a type 4 application in our SSD Market Adoption Model.



Debunking Misconceptions in SSD Longevity

Editor:- May 11, 2007 - BiTMICRO Networks today published a new article called - "Debunking Misconceptions in SSD Longevity."

It cites lifetime predictions from my own popular article - SSD Myths and Legends - "write endurance" and fires a warning shot aimed at some competitors by saying "some flash SSD makers have even quoted higher write endurance ratings than those provided by manufacturers of their flash memory components."

That's certainly true - but I knew when writing my article that endurance varies from batch to batch of flash chips within the same semiconductor fab process. Some SSD oems sample test and reject chips which are at the lower end of the distribution curve. That means their worst case numbers are better than would be the case by simply accepting merchant quality flash chips. Although starting from a different base of assumptions - BiTMICRO's article "conclude(s) that fears about the endurance limitations of SSDs are rightfully fading away."


Seagate Drops Notebook Drives

SCOTTS VALLEY, Calif - March 12, 2007 - Seagate Technology today announced the worldwide availability of a 7,200 RPM hard drive with free-fall protection for beefed-up laptop durability.

Momentus 7200.2 delivers up to 160GB of capacity and has a SATA interface. The hard drive is also offered with an optional free-fall sensor to help prevent drive damage and data loss upon impact if a laptop PC is dropped. The sensor works by detecting any changes in acceleration equal to the force of gravity, then parking the head off the disc to prevent contact with the platter in a free fall of as little as 8 inches. ...Seagate profile

Editor's comments:-
Hitachi revealed details about its similar ESP drop sensor in 2005. The drop sensor approach is better than nothing, but doesn't get around the unavoidable fact that hard disks can break when dropped.

Another approach is that of Olixir Technologies who have marketed repackaged high performance hard drives which can be dropped repeatedly onto a concrete floor from 6 feet and still survive.

But solid state disks are inherently even tougher than that because there are no internal moving parts to crash together. That's why they have been used in space ships, helicoptors and missiles. In 2006 In-Stat predicted that half of all mobile computers would use SSDs (instead of hard disks) by 2013. It's not just the ruggedness and better power consumption. A video by Samsung demonstrates the advantages more graphically.


Hard Disk MTBF Specs Incredible - Say User Reports

Editor:- February 28, 2007 - an article published today in Channel Insider - "Hard Disk MTBF: Flap or Farce? - casts serious doubt on the inflated MTBF claims made by all hard disk manufacturers.

Reviewing a number of recently published reliability studies from end users - the author David Morgenstern says "...there's a gap between the reliability expectations of manufacturers and customers. The current MTBF model isn't accounting accurately for how drives are handled in the field and how they function inside systems." ...read the article, storage reliability


Google Reports on HDD Reliability

Editor:- February 20, 2007 - Researchers at Google recently published a paper at the recent Usenix conference about hard disk reliability and failure prediction - based on their own experiences as a large user of hard disk drives.

The fascinating paper describes how Google measured available metrics and status reports generated by the drives themselves and how this correlated with actual failure patterns. One of the key insights in the report is Google's view of how useful SMART parameters were for predicting failures.

"Our results are surprising, if not somewhat disappointing. Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives... ...even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero counts on all variables." ...read the article, Hard disk drives, storage reliability

PS - the measured data on the percentage of disks which fail each year over a 5 year cycle under various conditions is essential reading for disk to disk backup contingency planning.



Agere Halves Power Consumption for Mobile HDD Interface

ALLENTOWN, Pa - February 6, 2007 - Agere Systems has begun shipping a new fully functional 90-nanometer TrueStore read channel.

The TrueStore RC1300 uses half the current required by the previous generation of read channel chip technology in this market segment and is 25% faster. It targets the 1.8-inch and smaller HDD form factor that provides critical data storage of 20 to 160 gigabytes in a wide variety of consumer devices. ...Agere Systems profile


STORAGEsearch.com Launches a New Strategic Directory - Storage Reliability

Editor:- June 20, 2006 - STORAGEsearch.com today launched a new directory dedicated to the subject of "Storage Reliability".

Reliability was named as one of the 3 most important future trends in storage in my state of the storage market article published last year. In that article I also predicted that uncorrectable failures in storage systems (due to embedded design assumptions made in earlier generations) could, if not dealt with by drive and interface designers, pose a more serious threat to enterprise computer systems than the Y2K bug in the late 1990s.

In addition to covering news about what the industry is doing to improve reliability in future drives, media and interfaces, STORAGEsearch has invited CTOs and technical directors of leading companies to write special articles about this subject - which will appear in the months ahead.

When most people think about storage reliability - they think about MTBF and thermal factors.

If an individual drive isn't reliable enough - wrap it in a RAID. If heat reduces the life of the disks - then cool them with more fans. If a memory system or interface is critical to an application - cocoon it with error detection and correction codes. Those are approaches which have worked adequately for the past few decades - but they are not good enough any more.

The demands for storage reliability are growing. Non stop applications need data that can be trusted to be available on demand. Compliance dictates that data should be readable not just years - but possibly decades after it was created. Meanwhile storage components, interfaces and systems are increasing in speed and capacity - while many of them are using error correction thinking that comes from earlier generations when data sets were smaller. As storage gets bigger - users face the risk of having uncorrectable errors in the heartland of their decision making data. That's why - all over the industry - manufacturers are starting to talk about new storage reliability initiatives.

There's also the risk that new storage technologies which get rushed to serve the needs of the consumer market - have not in fact been tested long enough to guarantee that they will not fail or start to corrupt data in the timeframe that enterprise customers care about.

Wrapping arrays of consumer disks based on new 2 year proven media technology in a big "enterprise" box - cannot guarantee that the data will still be readable in 5 years time. This is not a worry for consumers. They'll throw a failed disk away or buy a new one. But if your enterprise owns thousands of these disks (hidden by virtualization) it could be a big headache when the crumbly nature of the storage defects start to hit the news. This is another of the many concerns we'll be covering in these pages. Storage media have failed in the past and been withdrawn because they didn't meet their original extrapolated lifetimes. Lessons are not always learned from errors in the past - but can be forgotten and reoccur.

Storage reliability is changing. If you are interested - I hope you'll stay tuned to the new storage reliability channel here on the mouse site - as we report on these exciting developments in the months ahead.


Why Solaris will Get 128 Bit Addresses

Editor:- May 1, 2006 - an article today in InformationWeek.com discusses the Zettabyte File System - a new 128 bit addressing scheme for Solaris.

The article says that apart from the obvious advantage of being able to access more storage, Sun is apparently thinking about building in error correction into the new address scheme.

In a market forecast published last year in STORAGEsearch.com - Storage Reliability and failures were cited as one of the most important long term problems which oems and users will have to deal with.

The cause of the problem is that storage interfaces as well as modules and components (like disks, tapes, optical drives etc) use error correcting schemes which were designed for the much smaller and slower architectures of the past. As storage systems expand - new algorithms and correction schemes will be needed to guarantee that users don't get affected by data failures which are uncorrectable using today's products and protection schemes.

It's good to see that Sun is working proactively on one aspect of the problem. I've talked to many storage manufacturers about the upcoming reliability problem - which could be more serious than the Y2K threat - if not dealt with in advance. Sun is highly sensitive to data reliability concerns. Problems with its own SPARC server cache memory design back in 2001- were cited at the time by many large users as reasons for considering a switch to Intel and PowerPC based systems.

See also:- SPARC Product Directory


Hard Disk Sector Size May Change

SUNNYVALE, Calif - March 23, 2006 - IDEMA today announced the results of an industry committee assembled to identify a new and longer sector standard for future magnetic hard disk drives.

This Committee recommended replacing the 30 year-standard of 512 bytes with sectors having ability to store 4,096 bytes. Dr. Ed Grochowski, executive director of IDEMA US, reported that adopting a 4K byte sector length facilitates further increases in data density for hard drives which will increase storage capacity for users while continuing to reduce cost per gigabyte.

"Increasing areal density of newer magnetic hard disk drives requires a more robust error correction code, and this can be more efficiently applied to 4,096 byte sector lengths," explained Dr. Martin Hassner from Hitachi GST and IDEMA Committee member. ...IDEMA profile


Whitepaper Measures ROI of Disk Defragmentation

Burbank, CA - January 24, 2006 - Diskeeper recently sponsored IDC to write a whitepaper called - "Defragmentation's Hidden Value for the Enterprise."

This measured the ROI of defragmentation software in real customer sites. During the reliability test, the servers that were defragmenting files automatically had a higher uptime (5 to 10%) than the servers that didn't have defragmentation software automatically running. ...read the article (pdf), ...Diskeeper profile


ProStor Systems Unveils New Backup Technology

BOULDER, CO - November 2, 2005 - ProStor Systems made its public debut today by introducing the firm's RDX removable disk backup technology.

The RDX removable cartridge uses the same 2.5" hard disk media platters found in notebook computers and provides initial capacity upto 400GB (compressed). That will will increase in line with conventional hard disk technology. But the difference is that RDX uses a new patent-pending error correcting format, which makes the data 1,000 times more recoverable than in a standard hard drive. ProStor says this means that RDX-stored data will be readable even after the cartridge has been archived and non-operating more than a decade. ...ProStor Systems profile, Removable Storage, Disk to disk backup, Storage People

Editor's comments:-
the reliability of embedded storage modules and components such as disk drives, tape drives and optical disks will become an important issue for users in the next 7 years.

These products rely on inbuilt error correction algorithms which were designed over a decade ago - when storage capacities were much smaller. All those "ten to the minus something" numbers which you see quoted for error rates sound good - except that when your enterprise is managing Petabytes of data, at every higher connection speeds, then you will start seeing uncorrectable data failures occurring every year - inside the storage, and beyond the scope of your RAID or other protection scheme to correct. ProStor is one of a new generation of storage manufacturers addressing this problem, and we'll soon publish a directory section dedicated to storage reliability issues such as this.

storage search banner

click to go to the storage reliability page
.

"Reliability is more than just MTBF... and unlike Quality - it's not free. The battle for storage reliability never stops. It has to be fought - in every place where physics intrudes on data integrity. It must be fought and won anew - in every technology generation and in every new product design." - Zsolt Kerekes, editor

.
.
industrial CF cards from Cactus
industrial grade Compact Flash cards
from Cactus Technologies

related guides
.
can you always assume that newer storage is more reliable?
by Zsolt Kerekes, editor

I created this dedicated storage reliability page here on StorageSearch.com in 2006.

It includes strategic articles, news and developments about this subject and is updated monthly.

I had flagged Storage Reliability as a long term strategic concern for the market in a trends article (published in 2005) - in which I said that the risks posed by uncorrectable data failures due to systemic design flaws in storage drives "could be more serious than the Y2K bug threat - if not dealt with in advance."

Most people didn't understand what I was talking about. They (wrongly) assumed that they could always depend on oems to design a workable level of reliability into their storage products. And if that wasn't good enough - then a wraparound layer of RAID supported by some type of data backup would work well enough for their needs.

In 2010 - as we got sucked into the SSD market bubble we began to see more customer concerns about the poor reputation which some leading storage oems are acquiring - due to shipping undependable and incompletely verified SSD designs.

Reliability of many types of storage products will get worse than they were before. So too will data integrity. As I warned 5 years ago - the assumption that storage reliability is a boring subject which enterprises don't need to worry about - will be shown to be wrong.

The only way to understand these trends and to avoid disasterous vendor choices is to read and understand more about this subject.
.
SSD ad - click for more info
.
Data Integrity Study at CERN (pdf)
how fast can your SSD run backwards?
Latent Sector Errors in Hard Disk Drives
Increasing Flash Solid State Disk Reliability
SSD Myths and Legends - "write endurance"
Failure Trends in a Large Disk Drive Population (pdf)
reliability - editor mentions on STORAGEsearch.com
Are Disks the Dominant Contributor for Storage Failures?
Reliability Mechanisms for Very Large Storage Systems (pdf)
Reliability Modeling for Long Term Digital Preservation (pdf)
Empirical Measurements of Disk Failure Rates and Error Rates
Understanding Soft and Firm Errors in Semiconductor Devices (pdf)
Data Loss and Hard Drive Failure: Understanding the Causes and Costs
.
How fault tolerant PCIe SSD designs are supported in chips
...
PCIe in enterprise SSD designs - this video by PLX includes an introductory tutorial into PCIe and its performance and architectural capabilities for SSDs including automatic failover and multi-host capabilities. PLX's switch chips also supports failover if the fault occurs in the PCIe switch fabric chips themselves. ... click to watch the PCIe in SSD video

extract - "...And in case one of the hosts fails and you want to connect the SSDs - or the devices connected to that host - to another host - that can be done automatically as well - and the surviving host can attach the devices that were attached to the failing host to itself and control it so that the system doesn't go down and the data stored in these devices doesn't get isolated from the main system."
.
Data Integrity Challenges in flash SSD Design
This article is written by Kent Smith Senior Director, Product Marketing, SandForce.

Since bursting onto the SSD scene in April 2009, SandForce has achieved remarkably high reader popularity. How did a company whose business is designing SSD controllers achieve this? - especially when the direct market for its products today numbers less than 1,000 oems.

The answer is - that if you want to know what the future of 2.5" enterprise SATA SSDs might look like -you have to look at the leading technology cores that will affect this market. Even if you're not planning to use SandForce based products yourself - you can't afford to ignore them - because they are setting the agenda.

Reliability is the next new thing for SSD designers and users to start worrying about.
read the article about SSD integrity A common theme you will hear from all fast SSD companies is that the faster you make an SSD go - the more effort you have to put into understanding and engineering data integrity to eliminate the risk of "silent errors." ...read the article
.
A reader contacted me to say he was worried about the viability and reliability of large arrays of SSDs as used in the enterprise.

He said - "One thing that you don't touch on but SSD reliability engineers (a small discipline) do is the internal power conversion itself. The DCDC converters down-stream from the holdup caps or batteries also have a finite operational life time and certain specific failure mechanisms. If these fail, there is NO recovery since the power interruption is immediate."
FITs, reliability and abstraction levels in SSDs
.
Can You Trust Your Flash SSD's Specs?
Editor:- I've noticed is that the published specs of flash SSDs change a lot -from the time a product they are first announced, then when they're being sampled, and later again when they are in volume production.

Sometimes the headline numbers get better, sometimes they get worse. There are many good reasons for this.

The product which you carefully qualified may not be identical to the one that's going into your production line for a variety of reasons...

And here's another thing to worry about...

The enterprise flash SSDs which you benchmarked yourself - may surprise you by running much slower when deployed in your own applications due to common "halo" errors which are implicit in the set ups of many performance test suites which were originally designed for HDDs. ...read the article



.
SSD ad - click for more info



.

WARNING! - CONSUMER SSD

contents liable to change without notice
Editor:- June 13, 2014 - it seems that the risk of preplanned component substitutions by the original branded SSD maker (rather than merely the supply chain risk of counterfeits by persons unknown) is another uncertainty which readers in the consumer SSD market may now have to contend with. ...read more
.