StorageSearch.com article - How many disks does it take to store a disk-full of data?


leading the way to the new storage frontier	.....


cloud storage	..


auto tiering SSDs	..


RAID systems	..

Disk to disk backup


petabyte SSDs?	..


HDD vs SSD wars	..

.....

How many disks does it take to store a disk-full of data?

by Zsolt Kerekes, editor - November 10, 2010

How many disks does it take to store a disk-full of data?

...in a way which ensures you can always get to it quickly.

Is this a trick question?

Yes. No. Yes.

Ok. I confess. Maybe.

And this comes as no surprise to you - because you can see gallons of text dripping down into the bottom of your browser. Let's agree that it's not going to be the short and obvious answer.

But I promise that if you stay with me - as we wade through this article - you'll see I'm laying the foundations for some serious rethinking about the economics and received wisdom about how data storage is done.

At this stage - it doesn't matter whether we're talking about hard disks or the solid state kind. But I promise to return to that difference later - and to show why it could make a difference to the calculation.

So - what do I mean by a disk-full of data?

I'm not being tricky here. I mean simply a quantity of data that is unique and incompressible - and fills up a disk.

When I started thinking about this article a few years ago - I was going to phrase the question like this...

"how many terabyte disks does it take to store a terabyte of data?"

But now we've got 3TB disks - and one day we'll have 10TB disks - and eventually much bigger ones too. I want to avoid you (or me) having to reach for a calculator when following this article. Any maths involved will be the simple kind that can be done by counting fingers (and toes).

Let's start...

It would be perfectly reasonable to say. I've got a disk full of data. "A disk" - is equal to one disk. So the answer to the question is one disk.

End of calculation. End of article. Click and go onto the next web page in your busy browsing day.

Did I mention before? This data is very very valuable.

You're running a VC backed start-up company and it includes all the customer inquiries from your first month emerging from stealth mode. Or it's the compressed digital output of that new movie you've been editing - which everyone expects will be on the short list for the next Oscar. Or it's this morning's orders for your online retailing business. Just imagine whatever data it is that you wouldn't like to lose. That's what's on it. (If on the other hand you would be very happy to lose a disk full of data - for reasons you don't want anyone else to know - and you would prefer the data to vanish beyond forensic recall - there's another bunch of articles which will help you here.)

OK - so maybe at the local level - another disk would be a good idea. Let's call it the backup disk. To make sure we don't forget to do it - we're going to run the backup disk and the original disks (that's 2 disks so far) as a RAID 1 system. That's 2 mirrored disks. They're local. How local? Same box? Same office? - That's good enough for now.

If disk 1 fails - then disk 2 keeps me running - and vice versa.

But then again - if anything bad happened to either one of those 2 disks - like a simple hardware failure - you'd be back to where you started - and vulnerable. So let's have a hot stand-by -so that if one of those RAIDed disks fails - then the system automatically starts a RAID rebuild and creates another local copy on the standby disk before the lone surviving disk with the data fails. But that rebuild can take hours - and during those hours that sole survivor disk can fail too - before it has has finished cloning itself. And even if it doesn't - there's another hazard to prepare for...

Because all the local disks could fail together - for any of the following reasons - and some more too.

your building has burnt down, or been flooded or suffered some local disaster
your office has been broken into - and the disks stolen
a virus or software error or systems administration mistake has wiped all the local disks
a lightning strike - or power surge zapped the power supplies in such a way that all the equipment in your RAID system got fried

Not to worry. You're already ahead of me there. That's why you have an off site backup. And we're still counting disks - remember?

For the same reasons discussed above - for every diskfull of unique data - the other site is RAIDing your data - and making sure there's a hot standby.

So we're up to 6 disks.

But what if your office suffers a disaster - and the online backup service goes bust or stops the service when you need it. It happens a lot. Backups fail just when you need them. They may have actually failed before - but you didn't know because you didn't need them. But that's another story.

If you read the longer articles on my storage reliability page - you'll see that to ensure a realistic probability of getting your data back - you really need to spread your data risk on more than 3 disparate sites. But let's skimp on the cost and call it 3 sites.

So - 3 sites - each with 3 disks - that's 9 disks.

But what if I still need the same data - in 5 years time?

It may even be worth more then - because you might have figured out how to make more money out of it by then. Or it may be that you need the data for legal or other reasons. You're a bigger company now - and you can scale the value of that data. But only if you've still got it. Only problem is - that the typical life of a hard drive is 3 years - so you have to buy another 9 disks at some time. They will have more capacity than you need - due to technical progress. But putting 2 copies of your data on the same disk doesn't help you get instant access in the zap or flood situation. (Although it might help with data recovery.)

As you can see - the cost of replacing all those redundant disks starts to mount up.

So in a 5 year timescale you (or your backup surrogates) have been obliged to buy or rent about 18 disks.

Although there is some benefit from scaling up - and you don't incur exactly the same level of overhead - if you own 1,000 disks worth of data - the necessity to diversify across geography and across common mode vulnerabilities is a high overhead in all cases. And if you aren't doing this - you are only fooling yourself that you are covered.

And before someone emails me and suggests that cloud storage is the answer - consider this.

Just because you can't see all those disks out there failing - and just because they are someone else's day to day problem - doesn't make the maths go away. And when a big natural disaster or medium business disaster - or a little bad software upgrade - prevents you from talking to all those cloud disks - you will not be comforted by the knowledge that somebody else was responsible for counting the disks - looking at the flashing LEDs - and caring for them.

Did I remember to say that your business model means that you have to ensure that any of the data is accessible in a little more than 50 milli-seconds. So you can't use tape backup. Because it can take 30 seconds for tape libraries to find the right tape and access random data. And retrieving data from tape is far from being a certain process with a happy ending.

Enough of the pessimism.

Here's some good news.

In real-life all data is not unique.

And all data is not incompressible.

And all data is not equally valuable - but for the purposes of this article - we are going to suspend disbelief and maintain that it is. (Remember the Oscar? You're going to remaster that movie to make a 3d version...)

The mitigating factors I mentioned above mean that organizations - which own multiple disks worth of data - have a fighting chance of making their data survivable - using an order of magnitude less storage disks than suggested in my single disk case.

But if you add in the random access time requirement - then most hard disk based compression and dedupe systems still fail to meet the operational requirements.

Solid state storage systems can, however, deliver real-time compression and dedupe and still offer random access times which are as good as - or even better than that for uncompressed and undeduped hard drive arrays.

The reason they can do this is because a fast SSD's raw random IOPS can be hundreds of times faster than a hard disk (at the single disk level). So even - if the overhead of dedupe and compression create 50x more disk churn - the net result for a bulk storage data packed SSD system can still be better than that of an unpacked HDD system.

When you also take into account that reliable SSDs (as opposed to badly designed / flaky SSDs) may offer operating lives which are on average 3x as long as the best enterprise hard drives - then you see just one of the many reasons why the economics of bulk storage flash SSDs will start to look better than that of HDD arrays - in the datacenter - long before convergence in the cost per raw terabyte of storage.

...

Hard Disk Duplicators - click for expanded image

A day in the life of a
data protection wizard.

...

...

SSD Pricing - where does all the money go?

SSDs are among the most expensive computer hardware products you will ever buy and comprehending the factors which determine SSD costs is often a confusing and irritating process...

...which is not made any easier when market prices for apparently identical capacity SSDs can vary more than 100x to 1!

Why is that? ...read the article to find out

...

flash SSD capacity - the iceberg syndrome

Have you ever wondered how the amount of flash inside a flash SSD compares to the capacity shown on the invoice?

What you see isn't always what you get.

nothing surprised the penguins - click to read the article

There can be huge variations in different designs as vendors leverage invisible internal capacity to tweak key performance and reliability parameters. ...read the article

...

"One petabyte of enterprise SSD could replace 10 to 50 petabytes of raw HDD storage in the enterprise - and still get all the apps running faster."

the enterprise SSD software event horizon

...

SSD Data Recovery Methodologies

It's hard enough understanding the design of any single SSD. And there are so many different designs in the market.

Have you ever wondered what it looks like at the other end of the SSD supply chain - when a user has a damaged SSD which contains priceless data with no usable backup?

broken barrel image - click to read this data recovery article

This article - written by Jeremy Brock , President, A+ Perfect Computers - who is one of a rare new breed of SSD recovery experts - will give you some idea. read the article

...

The memory chip count ceiling around which the SSD controller IP is optimized - predetermines the efficiency of achieving system-wide goals like cost, performance and reliability.

Size matters! - in controller architecture

...

more articles on related themes

solid state storage backup
Surviving Non-traditional Data Disasters
Historic Milestones in Enterprise Disk Backup
The Cost of Owning and Storing Data (1999 classic article)

...

Surviving SSD sudden power loss

Why should you care what happens in an SSD when the power goes down?

This important design feature - which barely rates a mention in most SSD datasheets and press releases - has a strong impact on SSD data integrity and operational reliability.

This article will help you understand why some SSDs which (work perfectly well in one type of application) might fail in others... even when the changes in the operational environment appear to be negligible.

image shows Megabyte's hot air balloon - click to read the article SSD power down architectures and acharacteristics

If you thought endurance was the end of the SSD reliability story - think again. ...read the article

SSD stuff - long list - A to Z

1.0" SSDs
1.8" SSDs
2.5" SSDs
3.5" SSDs
19" rack SSDs

1970s, 80s, 90s, etc SSD history

About the publisher -21 years of guides
After SSDs... what next?
Analysts - SSD market
Analyzers - SSD
Animal brands in the SSD market
AoE storage
Articles and blogs - re SSD
Architecture guide - storage
ASAPs / Auto tiering SSDs

Backup software
Bad block management in flash SSDs
Benchmarks - SSD - can you trust them?
Best / cheapest SSD?
Big market picture of SSDs
Bookmarks from SSD leaders
Branding Strategies in the SSD market
Buyers Guide to SSDs

Calling for an end to SSD vs HDD IOPS
Can you tell me the best way to SSD Street?
Chips - storage interface / processors
Chips - SSD on a chip & DOMs
Clarifying SSD costs
Cloud storage - with SSD twists
Compression
Controller chips for SSDs
Cost of SSDs - why so much?

Data integrity in flash SSDs
Data recovery for flash SSDs?
Disk to disk backup
Disk sanitizers
Duplicators - HDD / SSD

more A to Z

StorageSearch.com is published by ACSL