DRAM's growing indeterminate latencies underlie the market opportunities for tiering flash as RAM and big data memory fabrics


leading the way to the new storage frontier	.....


the fastest SSDs	..


RAM news	..


SSD controllers	..


SCM DIMM wars	..


sizing SSD design	..

latency loving reasons for fading out DRAM in the virtual memory slider mix

some glimpses into the roots of DRAM's indeterminate latencies which underlie market opportunities for tiering flash as RAM and memory fabrics

by Zsolt Kerekes, editor - StorageSearch.com - March 1, 2016

Retiring and retiering enterprise DRAM was one of the big SSD ideas which took hold in the market in 2015 with 9 companies announcing significant product plans for this market. But looking back on my own past editorial coverage of SSD DIMM wars, rethinking RAM etc I realized I hadn't reported much about the details of DRAM's growing latency problems.

For certain we know there are problems because otherwise some of the flash inspired solutions for replacing portions of DRAM with slower tiered flash etc which we know work - wouldn't work so well.

But although we can guess at what some of these DRAM behavior problems may be - I thought it would be useful to round up a small set of linked articles which provide a clearer overview into what makes DRAM latency so variable. Here are some links which I found useful.

Big DRAM chips haven't got faster

Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture (pdf) is a classic paper by researchers at Carnegie Mellon University.

In the same way that hard drives chose capacity improvements rather than latency after the 15K RPM point (which was one of the enablers of the SSD market) there has also been a similar pattern in the 2D DRAM market.

The authors say - "In stark contrast to the continued scaling of cost-per-bit, the latency of DRAM has remained almost constant. During the same 11-year interval in which DRAMs cost-per-bit decreased by a factor of 16, DRAM latency decreased by only about 30%... From the perspective of the processor, an access to DRAM takes hundreds of cycles time during which the processor may be stalled, waiting for DRAM... However, the high latency of commodity DRAM chips is in fact a deliberate trade-off made by DRAM manufacturers."

Their paper goes on to discuss the cost benefits of introducing caches and tiering inside DRAM to improve latency (for some types of applications).

Lessons to take away from this? - DRAM latency (as seen from outside the chip) is already a market optimized spread of value judgments - determined by the capacity of the memory chip being sold and its related applications - rather than simply an absolute value which emerges as a complete whole out of the design of a raw memory cell.

latency picks and tricks in DRAM controller design related to their system impacts

As mentioned above - when seen from the processor point of view - minimum DRAM latency has stayed about the same (in nansoseconds) and had gotten worse in CPU clock cycles through more than a decade of successive memory interface generations.

The performance aspect which has indeed improved is the amount of data you get at that time due to internal parallel requests to similar internal arrays of memory cells - hence more bandwidth.

The cost trade-offs in different design approaches which can improve individual dimensions of latency were explored in paper Why DRAM Latency is Getting Worse (2011) and video (50 minutes) by Mark Greenberg, a veteran of memory controller marketing who at that time was at Cadence and is now at Synopsys.

These discussions surveyed design approaches ranging from changing the memory, changing the memory controller behavior and changing ways that CPUs access memory.

balancing tradeoffs of latency in DRAM design vidoe

This survey of permutations of design techniques and balancing the output consequences (cost, power, latency and bandwidth levels) is very useful if you're trying to understand why the memory industry has settled on the kinds of products and interfaces which we see in the market today.

Among the interesting concepts presented is a way to thing about the different types of masters whose requirements must be satisfied by the memory.

latency sensitive masters - need low latency all the time (CPUs and network packets)

bandwidth sensitive masters - need to maximize throughput (DMAs, GPUs)

maximum latency impacted masters - can tolerate upto but never beyond fixed limits (display drivers)

These classifications help the memory traffic controller decide the best ordering of transactions in a mixed workload.

The complexity needed to satisfy optimal average outcomes results in a range of worst case latencies which can in the region of 600 to thousands of DDR4 memory cycles.

This video analyzes many complex what-if design approaches which most of you don't need to know this level of detail. Instead one way to think about DRAM controller latency management is that a few simple ideas can generate a lot of complex behavior.

Just as with flash SSDs we've got used to the idea that the key determinant of SSD personality is the controller rather than the raw flash - we've reached a similar point in DRAM systems.

multicore workloads complicate DRAM predictability

A simple top level introduction to the nature of the problem is this - Debunking DRAM Determinism, Navigating NUMA Non-Uniformity from Diablo Technologies. This points out that in modern multicore processors the R/W activities of one core can adversely affect the timing of data needed by another core due to the design of DRAM systems.

If you want more details... the awesome complexity of what happens inside the memory system and just how difficult it can be to predict DRAM latency - even when you know the hardware environment - can be sensed by this deep dive research paper Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems (pdf) by Heechul Yun, Assistant Professor at the University of Kansas.

Among other things Yun discusses the scale of errors which can arise from simplifying assumptions about the number of interference events which can be generated by the activity of each parallel core.

let's skip cache misses

Another factor we have to take into consideration within the latency profile of DRAM is the proportion of time spent waiting to fetch new data from upstream devices.

This is often analyzed in the context of server cache hits and misses and real numbers are very specific to the workload and architecture and the relative sizing and speed of the DRAM itself compared to the next level up in memory, or SSD or other storage.

It's the kind of thing you'd look at with trace data from real applications if you're designing a new server SSD.

The history of investigating computing delays due to the unaffordability of having all the data all the time in the same physical memory goes right back to the roots of semiconductor memory. And the story is masterfully told in Before Memory was Virtual (pdf) a classic paper by Peter J. Denning who pioneered the principles of managing memory controller locality for virtual memory in the 1960s.

Denning's history of the development of virtual memory describes the needs and solutions of memory systems over a 40 year period into the web server age and he concluded that "Virtual memory is one of the great engineering triumphs of the computing age."

Denning's paper tells you everything you need to know about the principles of memory system behavior within a hierarchy of latencies.

I was hoping to add into this blog links to modern papers which include the same kind of raw memory interference data which Denning and his peers accessed in their labs - but written around modern DRAM and flash mixed memory environments. The closest matches are vendor sponsored benchmarks of new accelerator SSDs. But these tend to be post-design justifications rather than useful pools of raw design knowledge.

You might say that this is zone where the whole literature of SSD acceleration typically begins. But such SSD case studies are very product dependent and outside the scope of this current blog.

what have we learned so far?

DRAM latency in modern servers isn't as fixed or as good as most people commonly imagine.

Despite that - most applications can tolerate a small proportion of worst case latencies which are very much longer, provided the average latency is low enough.

If you think about it from a historic context - those application latency pain points - when the next level up data access went from server DRAM to hard drive - is what helped to open the door to enterprise SSDs.

However, the nature of data patterns was never as simple as that implies - because enterprise hard drives and RAID arrays were always front ended by RAM caches of their own which hid a lot of detail. The adoption of SSDs changed the relative values of the worst case latencies - but they will always persist.

All the DRAM centric improvements in applications performance which have been delivered in the enterprise in the decade leading up to the start of the SSD DIMM wars era were almost entirely due to having larger capacities of installed DRAM (and adaptations in memory aware software) coupled with leveraging upstream latency improvements from the insertion of SSDs into the storage mix - rather than due to any improvements in raw DRAM latency.

rules of thumb re future DRAM ratios?

Something which I found useful in earlier phases of SSD architecture adoption was thinking about ratios of things.

Examples were:- the ratios of flash to HDD, ratios of DRAM in flash SSD caches, ratios of SSD in servers compared to SAN, ratio of HDD servers which could be replaced by SSD enhanced servers etc.

When you come across application success stories which include ratios of such things - it gives you a feel for what may be possible in future applications of your own - without the need to understand all the internal architecture details straight away. (You can add to your own understanding after you've decided that your estimates of what is possible make it worthwhile investing precious time and brain power.)

At this stage of the SSD DIMM wars market - where almost none of the preannounced products have shipped yet and before many others have even been designed yet - there's a dearth of data and a lot of guessing and extrapolation to be done.

The way I think about sizing the retiering DRAM market is as follows.

Upper bounds and lower bounds of ratios can be thought of as fitting into these limiting bands - which are the adjacent architecture bands in terms of the SSD adoption user value propositions.

A lower bound limit?

The nearest similar thing we have here is the SSD CPU equivalence idea. When I wrote about it in 2003 - as a business justification of a future enterprise SSD accelerator market - I suggested a 2x to 4x replacement ratio (1 new SSDserver replaces 2x to 4x HDDserver) based on what I'd seen in the labs over a decade earlier. Since then we've had the benefit of more than 10 years of real market history - suggest that a replacement ratio of 3x is conservative and works well for many applications while higher ratios (over 20x) can also be valid in some intensive applications (like gaming).

An upper bound limit?

The next level up from DRAM (in classical computer architecture) is storage. We've got 2 pools of raw data to draw from here.

the experience of the hybrid drives market, and

the experience of the hybrid array market

Personally I prefer to ignore the former - because the hybrid drives market (as I predicted at the outset) hasn't been tremendously successful. So - partly for that reason - I think it's safe to ignore the 50x types of ratios we saw in SSHD drives.

And within the enterprise hybrid array market there are perils too - due to the need to weight the fractional benefits of capacity multiplication (enabled by virtualization, dedupe etc) versus the weighted average speed up. Also you have to ask - which is closer to the DRAM memory hybrid model which we're trying to construct? A hybrid SAN? Or a micro-tiered hyperconverged server?

2 sanity checks call out to me here.

The first one says - remember that slider switch which Tegile used to talk about in 2014!

The second says - remember that guide to data compression in cache and DRAM which was a news story in April 2015!

The Tegile slider (and the company's inevitable later launch of a non hybrid - that's to say a pure AFA product line) suggests that when it comes to an instantly available choice of speed or cost - and where they have the supporting applications decision data - a significant proportion of users slide the dial towards speed.

Whereas the cache compression techniques story suggests that the interface between pure DRAM and pure flash - as storage class memories - will probably include real-time compression options. And these in themselves could add at least another 2x change over and above any ratio (of flash to DRAM) derived from other analysis.

Taking those factor above into account suggests that 3x (the number we get from SSD-CPU equivalence) which becomes 6x (if you factor in compression) is too pessimistic to view as a replacement target of flash replacing DRAM - whereas 50x (the kind of ratio we see in value based hybrid storage arrays HDD to SSD capacity) is probably way too high.

You need a 10x ratio to get people even thinking about a different tier and while that is hard to accomplish with current software - it's the magic number which Diablo was talking about in 2015 in its Memory1 launch.

Eventually the market may end with a memory capacity ratio of flash -as-RAM to DRAM which is higher or lower than 10x depending on how the software market adapts to these challenges, the quality of the end use experience with the revirtualized memory mix and also how these and related memory technologies scale with respect to their relative costs and capacities in future generations.

Hmm... it looks like you're seriously interested in SSDs. I've written thousands of stories and guides related to the SSD market. The most popular can be seen here.

....

DRAM traffic management

memory latency lessons from SSD market history

The subtle indirectness of relationship between raw memory latency and application performance in enterprise environments was shown by the market lesson of how flash SSDs displaced RAM SSDs in over 99% of applications in a handful of years starting from the idea being shown to work in user installations in 2004.

Looking back at that period it's now clear that raw memory datasheet comparisons (between nand flash and DRAM and SRAM too) were almost an insignificant factor.

The success factors were:-

the lower cost per bit (which meant more data could be sucked into the acceleration zone)

the applications usefulness of read caching - compared to write caching (which minimized the importance of flash's poor intrinsic R/W cell asymmetry)

and the lower power consumption per bit of flash - which meant that huge numbers of flash chips could be deployed in big controller pools to emulate high random IOPS.

See also:- sugaring flash for the enterprise

DRAM technology won't advance soon.

"We are hitting something of a lithography wall in DRAM where shrinks are getting tougher and gains are not as attractive..."

Micron in SSD news August 2013

"...Application-unaware design of memory controllers, and in particular memory scheduling algorithms, leads to uncontrolled interference of applications in the memory system"

Are you ready to rethink RAM?

Even if you have a 3D packaging technique which can pack more chips into a box - DRAM loses out to nvm technologies which don't require refresh - when the scale of the installed capacity (and watts) in the box is high.

If you can't fit enough RAM into the same single box then the memory system accrues a box-hopping fabric-latency penalty which outweighs the benefits of the faster raw memory chip access times inside the original box.

cool runnings... Rambus to coach faster DRAM - April 2017

"At the technology level, the systems we are building through continued evolution are not advancing fast enough to keep up with new workloads and use cases. The reality is that the machines we have today were architected 5 years ago, and ML/DL/AI uses in business are just coming to light, so the industry missed a need."

From the blog - Envisioning Memory Centric Architecture by Robert Hormuth, VP/Fellow and Server CTO - Dell EMC (January 26, 2017)

A new controller architecture from Marvell - Final-Level Cache - brings the concept of tiered RAM and DIMM wars down from the cloud to products like phones which operate at battery scale. Using flash as RAM the company says can increase performance even while while halving electrical power.

SSD news - May 2016

Plexistor says that its Software-Defined Memory platform will support a wide range of memory and storage technologies such as DRAM and emerging nvm devices such as NVDIMM-N and 3D XPoint as well as traditional flash storage devices such as NVMe and NVMe over Fabric, enabling a scalable infrastructure to deliver persistent high capacity storage at near-memory speed.

SSD news - December 2015

Having SSDs located in a DIMM socket in one server no longer precludes that very same data being accessed by another server (with similar latency) as if it were just a locally installed PCIe SSD.

ideas linking Diablo and A3CUBE (September 2014)

"There is only one reason to buy SSD - performance!"

...thus spake an ad on this site for the RamSan-210 (a fast rackmount SSD) in 2002.

Since then we've found more reasons (6 in total).

Only 6? - Doesn't sound much - but they gave birth to all the business plans for all the SSDs in every market.

6 doesn't sound too difficult to understand does it?

But analysis of SSD market behavior becomes convoluted in the enterprise when you stir into the viable product permutations soup pot not only the raw technology ingredients (memory, software architecture etc ) but also throw in the seasoning desires of latent customer preferences.

...

how much flash is needed to replace all enterprise HDDs?

meet Ken and the enterprise SSD software event horizon

...

In the past we've always expected the data capacity of memory systems (mainly DRAM) to be much smaller than the capacity of all the other attached storage in the same data processing environment.

cloud adapted memory systems

...

....

Can you trust market reports and the handed down wisdom from analysts, bloggers and so-called "industry experts" any more than you can trust SSD benchmarks to tell you which product is best?

heck no! - whatever gave you that silly idea?

A considerable proportion of the customer needs which affect flash array buying behavior are still formally unrecognized.

Decloaking hidden segments in the enterprise

....

After 2003 the only technology which could displace an SSD from its market role was another SSD (or SSD software).

SSD market history

Revisiting Virtual Memory - read the book

Editor:- April 25, 2016 - One of the documents I've spent a great deal of time reading recently is Revisiting Virtual Memory (pdf) - a PhD thesis written by Arkaprava Basu a researcher at AMD.

My search for such a document began when I was looking for examples of raw DRAM cache performance data to cite in my blog - latency loving reasons for fading out DRAM in the virtual memory slider mix (which you see on the left of this page).

About a month after publishing it I came across Arkaprava's "book" which not only satisfied my original information gaps but also serves other educational needs too.

You can treat the first 1/3 or so of his opus as a modern refresher for DRAM architecture which also introduces the reader to several various philosophies related to DRAM system design (optimization for power consumption rather than latency for example) and the work includes detailed analysis of the relevance and efficiency of traditional cache techniques within the context of large in-memory based applications.

Among many other observations Arkaprava Basu says...

"Many big-memory workloads

a) Are long running,
b) Are sized to match memory capacity,
c) Have one (or a few) primary process(es)."

...read the book (pdf)

What we've been seeing in the enterprise PCIe SSD market in recent years is that most vendors are focusing on making these products more affordable rather than trying to push the limits of performance.

DIMM wars - the Diablo Memory1 incident

Sometimes to understand a new high level concept - you may need to first absorb and get familiar with a bunch of lower level SSD ideas - which are part of that framework.

Efficiency as internecine SSD competitive advantage

Why can't SSD's true believers agree on a single shared vision for the future of solid state storage?

the SSD Heresies

Whoops! there goes the hockey stick SSD sales ramp.

impacts of the enterprise SSD software event horizon

As the possibilities for deploying SSDs with application specific power, performance, and storage/server/app optimimized role becomes better understood - modern systems design groups are not only looking at designing their own SSDs (which has been happening for years already) but also looking at deeply customizing the SSD in their life - with customization which aims far beyond adding a few firmware tweaks to a COTS SSD controller and choosing the flash to populate someone else's reference design

aspects of SSD design - the processors used inside and alongside

We can't afford NOT to be in the SSD market...

Hostage to the fortunes of SSD

"You'd think... someone should know all the answers by now. "

what do enterprise SSD users want?

..

..

..

"The winners in SSD software could be as important for data infrastructure as Microsoft was for PCs, or Oracle was for databases, or Google was for search."

all enterprise data will touch an SSD

..

..

The enterprise SSD story...

why's the plot so complicated?

could it have been simpler?

the golden age of enterprise SSDs