Recovering Your Sun Hosted Business from Administrator Induced Data Loss

When bad things have
happened to your storage!

Data Recovery

SPARC Product Directory

Editor's intro:- Disk to disk backup is quick, simple and real-time. But what happens when your live data and your backup are simultaneously zapped by a systems administrator's script error? This true story relates the cautionary tale of how one Sun customer recovered their business when their server and database suppliers were unable to help them in a critical data loss situation.

Recovering Your Sun Hosted Business
from Administrator Induced Data Loss

Article by Ron Austin, ActionFront Data Recovery

SPARC History
high availability enterprise SSDs
Testing the Limits of Hard Disk Recovery
SSD Myths and Legends - "write endurance"
Recovering Data from Drowned / Flooded Hard Drives
looking back at 19 years of editor conversations with the data recovery market

Recovering Your Sun Hosted Business from Administrator Induced Data Loss

In March 2003, ActionFront solved a major technical glitch involving Sun Microsystems, EMC, Veritas and Oracle and saved the clients' business!

A data-centre user-error caused a major data loss emergency on a mission critical system. Had the company in question been un-able to regain access to their server and data, it would have imperiled the future of the business.

Here is a detailed description of the setting, the problem and the resolution.

The Setting

The main server was a SUN E5500 Server with 10 CPUs running under Sun Solaris
The application itself was based on Oracle Enterprise Edition ver 8.1.5
The data storage side ran under a Veritas file system with multiple partitions.
56 of these partitions were used to serve data to a particular application.
Approximately 260 files resided across these 56 partitions, and while ranging in size from ½GB to 8GB, most were approximately 2GB.
46 of the partitions resided on an EMC Symmetrix storage system, 10 partitions resided on SUN storage systems.
The EMC Symmetrix was configured as a series of discrete RAID0 arrays: 18GB drives, each mirrored to a second 18GB drive. Each array was portioned into two 9GB segments yielding the 46 partitions on that server.
The usable capacity exceeded 600GB in total.
A new EMC Clarion server had been purchased and the plan was to consolidate all the storage from the 56 partitions on the Clarion server. A migration process was planned.

Data Recovery Services on
STORAGEsearch.com

When Megabyte's storage got broken, he knew how to fix it as good as new.

The Problem

On March 8, 2003 - a quiet Saturday, the systems administrator wrote a script to perform the migration and then decided to test the script with the actual copy commands "commented-out". He made a typo error in the copy command, in effect instructing the main data storage to copy onto itself, and then compounded his mistake by commenting-out the wrong line.

He initiated the test run which then attempted to copy each file over itself. Under the Solaris/UNIX file system this over-wrote the file inodes – erasing all file allocation information and truncating each file to zero length.

Overwriting directory information, unlike the actual copying of data, is a very quick process and the damage was done almost instantly.

This erased all file allocation information.

Who You Gonna Call? - Ghostbusters? - No... the server wasn't haunted.

The administrator knew immediately that he had a huge problem. He started calling for help late on the Saturday. Each of his Vendors, (Veritas, SUN, EMC and Oracle), were sympathetic but could not offer any help at this stage.

He then started to look for a 24/7 Data Recovery service that was prepared to come on-site. One of the Data Recovery market leaders boasts about their Remote Recovery expertise. However when they were contacted they insisted that the entire storage configuration be shipped to them which the customer dismissed as impractical.

He soon found the ActionFront website where the 24/7 Critical Response service is promoted along with the standard "Priority Service", which is offered during extended hours, 6 days per week and meets the time-line and budget expectations of most of our customers. The Critical Response Service is for the select few clients that need an extra-ordinary level of around-the-clock service and have sufficient budget resources available to cover the costs.

The 24/7 contact information for the Critical Response Service is displayed on the ActionFront website, and the distressed admin reached the ActionFront consultant on call at 2 a.m. Sunday.

A long discussion of his circumstances and problem ensued. He made it clear that the system was needed to carry on day to day transactions. This precluded shipping it out or even shutting it down completely. On the advice of the ActionFront consultant, he did the very best thing he could under the circumstances. He un-mount all the damaged partitions in order to prevent overwriting any actual data. He missed one of the 56 partitions because it was running a process during the un-mount procedure.

ActionFront responded with a lot more than just good advice: by noon Sunday a plane ticket was purchased, the traveling recovery kit procured from the office and a senior technician was on his way to a distant city. He arrived at the customer site at 9:30 pm and went to work.

Prior to attempting any recovery activities, his first action was to make a complete copy of all media involved. The network remained live as the customer carried on "business as usual" as much as possible. He copied complete images of the affected drives, segment by segment to large capacity drives brought along as part of the recovery kit. The copy process was very slow because it was done over the live network.

Unhappy with the slow copy process, the ActionFront specialist analyzed the configuration and parameters he had to work under and soon devised a way to stream multiple devices simultaneously. This increased the transfer rate without affecting anything else, and shortened the overall recovery time.

A friendly voice which offers practical help is found on the phone Sunday morning 2 AM.

Working on the copied versions of the files, the ActionFront technician repaired and rebuilt the damaged file systems. Fortunately, the allocation map was intact and this speeded up the recovery process at this stage. (The allocation map could have been recreated, had been wiped out, but it would have extended the recovery time.)

One partition was not un-mounted at the beginning of the crises, resulting in overwrite damage to 4 of the 6 files on this partition. One of these files had its Oracle file header damaged and Oracle provided support for the recreation of this damaged header.

Some of the other logical devices that were damaged were also fixed at this stage and then reintegrated into the database.This story has a happy ending...

All was fixed by 6 pm Friday March 14, approximately six days after the crisis began. While the company had some ability to function during the crisis, the loss of much of its historical data threatened the profitability, and perhaps the viability of the company itself.

So why do data storage professionals need 3rd party Data Recovery Services?

Fault-tolerant data storage systems are generally reliable and well managed. When device failure, user-error and other problems do cause these systems to fail, it is a rare event; often the first time the operator has been faced with these circumstances. It can be beyond the training and experience of most of the technical community including Vendor technical support, let alone the unlucky systems administrator. Under the stress of a data emergency, even the best technicians can make mistakes when in unfamiliar territory; whereas our data recovery specialists deal with these situations every day and they are well qualified to address the problems.

Free advice about dealing with a data loss emergency is available.

ActionFront has published a number of articles with general advice which can help you including:- the Data Emergency Guide - IT Professional Edition and the original Data Emergency Guide (both of which can be viewed by clicking on these links without log-ins on our sister site STORAGEsearch.com - ed.)

Why Alliances with ActionFront Benefit IT and Data Storage Vendors and VARs

Data loss victims are usually distressed and sometimes they can be angry, even though the anger is misdirected. A referral to ActionFront, a trusted 3rd party authority, can:

Bring a positive resolution to the support call.

Avoid exposure of trying to provide Do-it-Yourself data recovery advice.

Help separate a data recovery issue from the expectation that it should be covered by warranty.

Calm down the customer and help them understand that their problem is not vendors' fault.

Resolve the customers' problem by recovering their data and making their systems operational again.

...ActionFront Data Recovery profile

Editor's afterword:- there are no silver bullets which will protect your data from all data demons. This article clearly shows that relying on a single type of data protection technology makes you vulnerable to common mode failure. In this case, the common mode was an installation script, but it can just as easily be a virus. You should always use more than one type of backup technology for peace of mind. But it's good to know that there are companies you can call when all else fails.

storage history (1949 till last week)

STORAGEsearch

SPARC Product Directory

ACSL - the publisher