sizing SSD controller architecture - 3d nand layer aspects


leading the way to the new storage frontier - 1998 to 2018	.....


SSD news	..


SSD history	..


sizing SSD design	..


SSD controllers	..

the SSD Heresies

3d nand and new dimensions in SSD controller architecture

research exploits layer based differences

by Zsolt Kerekes, editor - StorageSearch.com - August 28, 2018

In the early years of nand flash memory adoption in the enterprise (for simplicity let's call this period the MLC (pre-TLC) / pre-adaptive R/W / pre-DWPD era) there wasn't the same kind of established delineation of application roles for new SSDs as there is now -

because SSDs were still carving out new reasons to be used in design wins (almost one startup at a time) and it also happened quite often that when a new product was announced there would be significant gaps in the datasheets compared to what was needed to be known to determine how the product might behave (without having to invest large amounts of resources into benchmarking and evaluations).

To help my readers in this formative period I suggested several shortcuts which could help potential integrators group such new SSDs into sets determined by key design and architectural decisions in the new SSDs.

These enabled anyone who thought a lot about SSD controllers to decide for themselves - yes this new one is in this set and so some its characteristics are preordained - it's better at this, worse at at that - irrespective of whether there were any datasheets or benchmarks or whether we believed that such benchmarking had been correctly set up (which for a long time it wasn't). I know from the conversations I had with many systems designers that they found some of my "filtering" terms to be useful shortcuts - and most of the companies which were creating these new products found it useful to answer my questions about the internals of their designs and thinking.

But all such rules of thumb have a limited shelf life. And as I used to remind readers in my year end articles - it's just as important to discard old ideas which at one time were useful as it is to adopt new ones.

One of the simplest SSD design filters which I wrote about was something I called the difference between big and small SSD controller architecture (2011).

At the heart of this was the question - how many memory chips has the controller been optimized for? Because if it can work with a single digit set of chips then the controller can't employ as many clever strategies (to help reliability, performance and quality of performance) as another design which has been designed with a floor level of tens or hundreds of chips. It was a simple idea and it was a useful way to look at controller designs over a 10 year period.

But a paper I saw this month made me reconsider whether that division still works. And even to ask the question - are there any small architectue SSDs left at all?

The paper in question was - Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation (pdf) by Yixin Luo and Saugata Ghose (Carnegie Mellon University), Yu Cai (SK Hynix), Erich F. Haratsch (Seagate Technology) and Onur Mutlu (ETH Zürich) - which was presented at the SIGMETRICS conference in June 2018.

This paper - among other things - suggests several new (not previously publicly written about) design approaches which can be adopted with tall (30 layers upwards) 3D nand flash - which can leverage characterization assessments which are made on a small sample of cells in a memory chip and leverage those with architectural support in an SSD controller to increase SSD reliability or performance so as to make enterprise use of such memories more attractive.

One of the ideas discussed in the above paper is the idea that the quality of cells varies in each layer. This in itself is not new. What is new however is that the authors show how the spread of reliability can be measured, modeled and harvested.

The authors say - "We are the first to provide detailed experimental characterization results of layer-to-layer process variation in real flash devices in open literature. Our results show that the raw bit error rate in the middle layer can be 6x the raw bit error rate in the top layer."

Among the many chip dependent design approaches in the paper here are 2 which I've singled out.

LaVAR - Layer Variation Aware Reading

LI-RAID - Layer-Interleaved RAID

Layer Variation Aware Reading (LaVAR) - "reduces process variation by fine-tuning the read reference voltage independently for each layer."

This idea - which properly occurs in the realm of adaptive R/W technology (rather than big controller architecture) suggests a simple model which can predict a best guess threshold voltage for P/E based on a top/bottom samples extracted after endurance conditioning a small number of blocks in the memory.

On its own - this concept would be enough to make the paper a must-read for controller designers.

My gut feel is this points the way to a middle course of run time controller design between 2 well known philosophies:-

the adaptive DSP ECC approach - which combines chip learned characterization with heavy weight run time processing power in the target controller and

the machine learning / lifetime based characterization models proposed by NVMdurance in 2013 - which enables lightweight run time processing - based on a model which extrapolates the best figures for a population of all memory chips - but is learned from a factory based characterization (rather than learned from the local chips attached to the controller).

Layer-Interleaved RedundantArray of Independent Disks (LI-RAID) - "improves reliability by changing how pages are grouped under the RAID error recovery technique. LI-RAID uses information about layer-to-layer process variation to reduce the likelihood that the RAID recovery of a group could fail significantly earlier during the flash lifetime than the recovery of other groups."

This - to me - starts to look like another "big controller" architecture idea - but the authors say it can be used in an SSD with just a couple of chips. They also extend the concept to pairing the best predicted blocks in one memory chip with the worst predicted blocks in another memory chip in the same SSD.

You can read about earlier uses of RAID thinking in SSD controller designs (including variable size planes) in my RAID systems page.

But it's clear that the interpretation of different layers in a 30 to 100 layer or so 3D memory chip starts to look a lot like big controller architecture.

Previously it was the number of different identifiable conceptual toys in the box which set the limits to system level design tricks. Now it's layers in the same chip too.

....

new thinking in SSD controller techniques reveals "layer aware" properties exploitable in 3D nand flash

Editor:- August 28, 2018 - A new twist using RAID ideas in SSD controllers has surfaced recently in a research paper - Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation (pdf) by Yixin Luo and Saugata Ghose (Carnegie Mellon University), Yu Cai (SK Hynix), Erich F. Haratsch (Seagate Technology) and Onur Mutlu (ETH Zürich) - which was presented at the SIGMETRICS conference in June 2018.

The authors say that in tall 3D nand (30 layers and upwards) the raw error rate in blocks in the middle layers are significantly worse (6x) compared to the top layer. Therefore to enable more reliable and faster SSDs using 3D nand for enterprise applications they propose a new type of RAID which pairs together the best predicted half of a RAID word with the worst predicted half from another chip in the same SSD.

This new RAID concept starts to be feasible in a very small population of chips - unlike traditional 2D nand schemes which need more chips to be installed in the SSD.

The new RAID is called Layer-Interleaved RAID (LI-RAID) - which the authors say "improves reliability by changing how pages are grouped under the RAID error recovery technique. LI-RAID uses information about layer-to-layer process variation to reduce the likelihood that the RAID recovery of a group could fail significantly earlier during the flash lifetime than the recovery of other groups." ... read the article (pdf)

Editor's comments:- the new RAID is just one of many gems in this research paper. Others being the discovery that remanence in 3D nand includes a significant short term charge loss (in the first few minutes after writes), and also that an endurance based characterization of a small part of each chip can be used to predict an optimized layer dependent threshold read voltage for all the layers in the chip. I've discussed the significance of adding the concept of "layers" to "number of raw chips" to the thinking in SSD controller design in my recent home page blog.

If you could go back in time and take with you a factory full of modern memory chips and SSDs (along with backwards compatible adapters) what real impact would that have?

are we ready for infinitely faster RAM?

DWPD is now being used as an indicator in every market which uses SSDs

what's the state of DWPD?

40 years of thinking about endurance in non volatile memories - a personal retrospective

wrapping up SSD endurance

we've been hit and we're going down - but the data will be back

surviving sudden SSD power loss

as you'd expect with such a vast topic - the boom and bust business cycles in memory have been analyzed and dissected many times before

boom bust cycles in the memory market

what chips has it got in its boxes Precious?

playing the enterprise flash box riddle game

There was a young lady called Prudence
Was worried 'bout flash's endurance
She met an IP
Who said - stick by me
My software will be your ensurence

the limericks of endurance

Data recovery from DRAM?

I thought everyone knew that

why did DRAM latency get stuck in the Y2K lane?

latency loving reasons for fading out DRAM

....

This may be a stupid question but... have you thought of supporting a RAMdisk emulation in your new "flash tiered as RAM" solution?

what characteristics could we learn?

....

a strange way to earn a living

can you tell me the best way to get to SSD Street?

The enterprise SSD story...

why's the plot so complicated?

and was there ever a missed opportunity in the past to simplify it?

the elusive golden age of enterprise SSDs

....

I love ratios as they have always provided a simple way to communicate with readers the design choices in products which tell a lot to other experts in that field.

re RATIOs in SSD architecture

....