leading the way to the new storage frontier	.....


SSD endurance	..


SSD power loss	..


military SSDs	..


SSD symmetries	..


SSD controllers	..

wrapping up SSD endurance

selective memories from 40 years of thinking about endurance

by Zsolt Kerekes, editor - StorageSearch.com - July 20, 2018

The intertwined and evolving actual and mythical relationships between the write endurance of raw flash memory chips and the reliability of the SSD drive / array in which they are used as the primary storage components - has been been one of the most popular topics read by readers of StorageSearch.com for over 12 years. However my own editorial coverage of that subject started several years before that - at a time when SSD makers were still nervous about talking openly about the very idea that their SSDs had any wear-out issues - which could lead to sudden death of the entire SSD - at all.

I must admit that the enduring interest in endurance and the high popularity of these articles was at many times irritating for me - particularly when I had just written about other aspects of SSD design architecture (which I thought were just as important) - but the constant tides of memory cell shrinks and SSD performance progress kept pulling me back to write again and again about endurance. Including many articles I have now forgotten but which can be found in the news archive.

Each time that leading SSD thinkers had reached some kind of consenus about the relationships between the different types of memories and how best to manage and deploy them in SSDs a new innovation in flash controller design would come along to facilitate a stretch to applications elasticity which busted previous limits.

Early on in this long running saga I told my readers that there were few hard rules except these.

Raw memory endurance is not the same as SSD endurance.

The SSD can live much longer or much shorter than the average life expectancy of a typical memory cell - when viewed from the R/W perspective of host write requests.

These genuine differences come from differences in understanding and differences in design of controller architecture (which includes software).

The quality of designs and their footprint (chipcount, power usage and IP complexity) vary by orders of magnitude - even in SSDs which superficially are aimed at similar markets and which are being sold at the same time.

The risk of early burn out is real.

If you use an SSD in a way which the designers didn't intend.

On the other hand the cost of over specifying an SSD means that you may end up paying many times more than you needed to.

That's why there is no such thing as an ideal endurance figure in a flash memory, or an ideal DWPD for an SSD.

The applications context and business case are important boundary factors which define how endurance factors are managed in the optimally affordable SSD.

As I hope to sell StorageSearch.com and will no longer be writing much about the SSD market in 2019 I thought I'd write one last article which looks back at some of my memories about endurance.

nvm endurance in 1978 to 1980

My first encounter with the idea of write endurance in semiconductor memories came in 1978 as a theoretical warning in a datasheet for a new memory product called EAROM. In those days I used to read datasheets for chips and processors in the same way that editors nowadays read blogs and news stories. Having digested the datasheet (but not having any immediate need for that memory myself in my own designs) I wasn't greatly surprised when a company I later worked for - in 1980 - recalled their memory modules which had used those memories inside because of failures in the field - due to premature remanence or wear-out - I didn't ask my colleagues which. Temperature may also have been a factor too - because the AMD bit-slice processors to which the non volatile memory had been attached in the 1979 PLC design ran hot. (The solution to that design problem was battery backed CMOS RAM - an option which had been discounted earlier because of its dependence on the reliability of the attached battery.)

1984

The next time I met the subject of write endurance - in 1984 - it was another incidental thing - and not something I used in a production design. I noticed that when saving data to Intel's 2816 (an EEPROM) some of the locations could be written to with much fewer write pulses than others. This meant they had better cells and could be written to more quickly. But Intel also cautioned anyone who might play around with these chips that writing too aggressively could damage the chips. In later non volatile memory chips - the write pulse mechanism was embedded in the chip. This made writes more foolproof. And I don't think that most electronic systems engineers gave any thought to the variability of what may be hidden behind the write mechansim for another 20 years.

2004 - flash takes aim at the server acceleration space of RAM SSDs

For me and my working life the subject of write endurance in flash memory became a big deal from 2004 onwards when flash SSDs began to infiltrate the server acceleration market. At first warnings from experts in the SSD industry that users would experience short working lives with flash SSDs due to burn out were proven to be correct. But this didn't deter users who mostly liked the performance gains they were getting and in some cases simply adjusted their buying behavior to refresh the early flash drives very frequently. Also, early burnout wasn't inevitable in arrays which used appropriate SSD controller architectures and related techniques.

2005 - a classic article on wear leveling

Another angle on SSD endurance was (and still is) longevity in industrial SSDs. In 2005 SiliconSystems published a classic paper on wear leveling here on StorageSearch.com and invested a lot of resources in ensuing years to educate industrial systems designers were familiar with the reliability factors associated with different elements in the design of SSDs.

Also in 2005 I began a news thread on storage reliability which datamined related stories from SSD news. In 2008 - as SSDs became a greater part of all the content here I collected up SSD reliability papers into one place. Even in those days endurance was just one part of the reliability mix as you can see.

By 2010 the SSD market had become much better acquainted with the idea that SSD controllers were an important and separate part of every SSD design and specialist companies in that area were surprised to learn how much hunger there was for trustworthy articles which explained what they did and why. My article Imprinting the brain of the SSD noted how big a change that was compared to before.

After that time most stories about SSD endurance became part of mainstream SSD news - but you can sample how some of the metrics and ideas appeared in archived versions of the SSD controller page and infrequent updates to my 2007 article - SSD endurance myths and legends.

An important sanity check is that most of the key people in the SSD industry (including designers of SSDs and founders of SSD companies and their biggest customers) were reading these pages during this period. And my self appointed aim was to help guide the industry forwards in directions which aligned with my own predictions.

2010 - and the start of the SSD market Bubble

2010 was Year 1 of the SSD Market Bubble. (For the significance of other years - see SSD market history.)

From this point the hitherto unknown and secretive SSD controller industry invested huge intellectual resources and amazing talent to enable each successive generation of (less reliable) flash memory to be used in reliable SSDs and systems. And as the SSD market continued to grow in revenue and strategic importance - the big manufacturers of memory - which earlier had little reason to understand the SSD potential of the chips they had been making - began to digest lessons from the SSD market and understand understand the applications for their raw memories better.

a 2018 perspective

Because users were deploying SSDs in different ways to earlier types of storage and memory the industry took another 5 to 10 years to characterize what a "good" level of endurance would be for particular applications.

And every few years when new types of 2D flash memory came into production with greater capacity but lower endurance - the wear mitigation arguments and analysis began again from new (and more challenging) starting points.

In the past 10 years the 5 factors which have done the most to set the stage for the market acceptance of flash memory endurance in usable SSD roles have been:-

adoption of DWPD - drive writes per day - as a standard way to signal which applications a new SSD has been optimized for.

Endurance became a knowable factor and users didn't need to be scared about its existence - as long as they chose the right SSD for the application.

The SSD market grew alongside other markets which it helped to create. So for example the idea of a low DWPD SSD - in cloud infrastructure - as a valued and desirable product would never have been in anyone's SSD business plan in 2004 - when the primary user value proposition for enterprise use was server acceleration.

Adaptive flash care management & DSP integrity IP in SSDs - was a movement in SSD controller design - to invest extremely sophisticated intelligence and noise filtering techniques inside each SSD which - among other things - enabled the use of light weight (and less damaging) write pulses to be used - compared to traditional hard codable ECC.

The adoption of big SSD controller architecture and using software (for example Software-Defined Flash (Baidu March 2014), Host Managed SSD Technology (OCZ - October 2015 and other techniques and names), to leverage array level intelligence and also adaptive intelligence flow symmetry (see article for citations) to manage the movement of data and reliability in SSD arrays has become the normal way of doing things.

Each AFA company and cloud integrator use their own brews of standard and proprietary IP tricks and this is a an area of design which is still evolving with in-situ processing.

Machine Learning as the discovery tool for the best ways to explore the optimum settings for R/W (timings and pulse shapes) when characterizing new generations of 3D flash.

This is a technique (first widely disclosed in 2013) which promises to maximise flash endurance when used in conjunction with lightweight SSD controllers. (As opposed to the kind of heavyweight energy and CPU footprint required by adaptive DSP to achieve similar ends.)

It was pioneered by NVMdurance.

the endurance rot stopped with 3D flash

The slide to worsening endurance ratings in raw flash memory seemingly paused and improved during the transition from 2D to 3D due to the use of more expensive materials and more charge being trapped in each cell and with higher capacity coming from more planes of flash cells rather than a single plane of smaller cells.

beyond flash?

All nvms have endurance issues - although some are more serious than others - compared to flash. For example first generation 3DXpoint PCIe SSDs from Intel had similar or worse DWPD ratings than best in class flash SSDs. Whereas other memory types such as MRAM and FRAM have endurance which is orders of magnitude higher than flash - although their data capacity per chip is currently orders of magnitude smaller than flash.

It seems likely that DWPD will remain a useful way to select SSDs for storage. However the best way to characterize the reliability (and performance) of memories in new tiered memory systems (DIMM wars and cloud adapted memory) is a problem which is as far away from any commonly agreed useful solutions as today's neatly ordered classifications and segmentations of SSDs were 10 years ago.

conclusion

The subject of memory endurance and how that relates to the reliability of SSDs and tiered memory is one which has provided much food for thought for millions of my readers in past years.

But there have been lighter moments too.

I combined a serious historic narrative with some attenpt at humor in the "naughty flash" description in sugaring flash for the enterprise.

Whereas my article razzle dazzling flash SSD cell care and retirement plans was intended to show just how ridiculous some of the comparative endurance management claims in the SSD market had already become in 2012.

....

SSD news

SSD history

popular SSD articles

SSD endurance myths and legends

We all know (or think we know) that drinking a bottle of vodka every day might reduce your longevity prospects.

And maybe smoking 200 cigarettes a day should have the same negative effect too.

But then we heard the story about that Russian peasant who's been living in the mountains on a diet of vodka, cigarettes and freeze dried goat - who sneaked down to the village to ask if it was safe to move back.

Is Lenin dead yet? - he asked.

He had been asking that very same question every Spring for over 90 years....

razzle dazzling flash SSD cell care and retirement

If you could go back in time and take with you a factory full of modern memory chips and SSDs (along with backwards compatible adapters) what real impact would that have?

are we ready for infinitely faster RAM?

Choosing a slow interface for a high capacity SSD is the route whereby one innovative enterprise SSD maker was able to offer "no limits DWPD".

what's the state of DWPD?

The semiconductor memory business has toggled between under supply and over supply since the 1970s.

an SSD view of past, present and future boom bust cycles in the memory market

Enterprise DRAM has the same latency now (or worse) than in 2000. The CPU-DRAM-HDD oligopoly optimized DRAM for a different set of assumptions than we have today in the post modern SSD era.

latency loving reasons for fading out DRAM

To be? or Not to be?
hold up capacitors in 2.5" MIL SSDs

0 to 3 seconds - aspects of extreme SSD design

Why can't SSD's true believers agree upon a single coherent vision for the future of solid state storage? (They never did.)

the SSD Heresies.

The memory chip count ceiling around which the SSD controller IP is optimized - predetermines the efficiency of achieving system-wide goals like cost, performance and reliability.

size matters in SSD controller architecture

the value of comparing one thing you might not understand so well to the size of a more familiar nother

re RATIOs in SSD architecture

forcing endurance stress helps reliability assessment of 3D nand

For 40 years we got so used to the idea that early wear out in flash is a bad thing and excessive writes were something to be avoided or mitigated - especially before the SSD had even shipped to the customer.

But a research paper in August 2018 showed how deliberately wearing out a small number of blocks in 3D nand flash (using 10K P/E cycles) can be used as a tool to help calibrate the memory and measure parameters which can be used to increase the reliability of the remaining flash blocks using new architectural techniques.

For more about the paper and architectural aspects see 3d nand and new dimensions in SSD controller architecture