.. |
|
Recovering Your Sun Hosted Business from Administrator
Induced Data Loss
In March 2003, ActionFront solved a major
technical glitch involving
Sun Microsystems,
EMC,
Veritas and
Oracle and saved
the clients' business!
A data-centre user-error caused a major data
loss emergency on a mission critical system. Had the company in question been
un-able to regain access to their server and data, it would have imperiled the
future of the business.
Here is a detailed description of the setting,
the problem and the resolution.
The Setting
- The main server was a SUN E5500 Server with 10 CPUs running under
Sun Solaris
- The application itself was based on Oracle Enterprise Edition ver
8.1.5
- The data storage side ran under a Veritas file system with multiple
partitions.
- 56 of these partitions were used to serve data to a particular application.
- Approximately 260 files resided across these 56 partitions, and while
ranging in size from ½GB to 8GB, most were approximately 2GB.
- 46 of the partitions resided on an EMC Symmetrix storage system, 10
partitions resided on SUN storage systems.
- The EMC Symmetrix was configured as a series of discrete RAID0 arrays:
18GB drives, each mirrored to a second 18GB drive. Each array was portioned
into two 9GB segments yielding the 46 partitions on that server.
- The usable capacity exceeded 600GB in total.
- A new EMC Clarion server had been purchased and the plan was to consolidate
all the storage from the 56 partitions on the Clarion server. A migration
process was planned.
|
|
The Problem
On March 8, 2003 - a quiet
Saturday, the systems administrator wrote a script to perform the migration and
then decided to test the script with the actual copy commands "commented-out".
He made a typo error in the copy command, in effect instructing the main data
storage to copy onto itself, and then compounded his mistake by commenting-out
the wrong line.
He initiated the test run which then attempted to copy
each file over itself. Under the Solaris/UNIX file system this over-wrote the
file inodes erasing all file allocation information and truncating each
file to zero length.
Overwriting directory information, unlike the
actual copying of data, is a very quick process and the damage was done almost
instantly.
|
This erased all file allocation
information. |
Who You Gonna Call? - Ghostbusters? - No... the server
wasn't haunted.
The administrator knew immediately that he had a
huge problem. He started calling for help late on the Saturday. Each of
his Vendors,
(Veritas, SUN, EMC and Oracle), were sympathetic but
could not offer any help at this stage.
He then started to look for
a 24/7 Data Recovery service that was prepared to come on-site. One of the Data
Recovery market leaders boasts about their Remote Recovery expertise. However
when they were contacted they insisted that the entire storage configuration be
shipped to them which the customer dismissed as impractical.
He soon
found the ActionFront website where the 24/7 Critical Response service is
promoted along with the standard "Priority Service", which is offered
during extended hours, 6 days per week and meets the time-line and budget
expectations of most of our customers. The Critical Response Service is for
the select few clients that need an extra-ordinary level of around-the-clock
service and have sufficient budget resources available to cover the costs.
The
24/7 contact information for the Critical Response Service is displayed on the
ActionFront website, and the distressed admin reached the ActionFront consultant
on call at 2 a.m. Sunday.
A long discussion of his
circumstances and problem ensued. He made it clear that the system was needed
to carry on day to day transactions. This precluded shipping it out or even
shutting it down completely. On the advice of the ActionFront consultant, he
did the very best thing he could under the circumstances. He un-mount all the
damaged partitions in order to prevent overwriting any actual data. He missed
one of the 56 partitions because it was running a process during the un-mount
procedure.
ActionFront responded with a lot more than just good advice:
by noon Sunday a plane ticket was purchased, the traveling recovery kit procured
from the office and a senior technician was on his way to a distant city. He
arrived at the customer site at 9:30 pm and went to work.
Prior to
attempting any recovery activities, his first action was to make a complete copy
of all media involved. The network remained live as the customer carried on "business
as usual" as much as possible. He copied complete images of the affected
drives, segment by segment to large capacity drives brought along as part of the
recovery kit. The copy process was very slow because it was done over the live
network.
Unhappy with the slow copy process, the ActionFront specialist
analyzed the configuration and parameters he had to work under and soon devised
a way to stream multiple devices simultaneously. This increased the transfer
rate without affecting anything else, and shortened the overall recovery time.
|
A friendly voice which offers practical
help is found on the phone Sunday morning 2 AM. |
Working on the copied versions of the files, the
ActionFront technician repaired and rebuilt the damaged file systems.
Fortunately, the allocation map was intact and this speeded up the recovery
process at this stage. (The allocation map could have been recreated, had been
wiped out, but it would have extended the recovery time.)
One partition
was not un-mounted at the beginning of the crises, resulting in overwrite damage
to 4 of the 6 files on this partition. One of these files had its Oracle file
header damaged and Oracle provided support for the recreation of this damaged
header.
Some of the other logical devices that were damaged were also fixed at
this stage and then reintegrated into the database.This story has a happy
ending...
All was fixed by 6 pm Friday March 14, approximately six
days after the crisis began. While the company had some ability to function
during the crisis, the loss of much of its historical data threatened the
profitability, and perhaps the viability of the company itself.
|
|
So why do data storage professionals need 3rd party Data
Recovery Services?
Fault-tolerant data storage systems are
generally reliable and well managed. When device failure, user-error and other
problems do cause these systems to fail, it is a rare event; often the first
time the operator has been faced with these circumstances. It can be beyond the
training and experience of most of the technical community including Vendor
technical support, let alone the unlucky systems administrator. Under the
stress of a data emergency, even the best technicians can make mistakes when in
unfamiliar territory; whereas our data recovery specialists deal with these
situations every day and they are well qualified to address the problems.
Free
advice about dealing with a data loss emergency is available.
ActionFront
has published a number of articles with general advice which can help you
including:- the
Data Emergency
Guide - IT Professional Edition and the original
Data Emergency
Guide (both of which can be viewed by clicking on these links without
log-ins on our sister site STORAGEsearch.com - ed.)
Why Alliances
with ActionFront Benefit IT and Data Storage Vendors and VARs
Data
loss victims are usually distressed and sometimes they can be angry, even though
the anger is misdirected. A referral to ActionFront, a trusted 3rd party
authority, can:
- Bring a positive resolution to the support call.
- Avoid exposure of trying to provide Do-it-Yourself data recovery advice.
- Help separate a data recovery issue from the expectation that it should be
covered by warranty.
- Calm down the customer and help them understand that their problem is not
vendors' fault.
- Resolve the customers' problem by recovering their data and making their
systems operational again.
|
|
...ActionFront
Data Recovery profile
Editor's afterword:- there are no silver bullets which will
protect your data from all data demons. This article clearly shows that relying
on a single type of data protection technology makes you vulnerable to common
mode failure. In this case, the common mode was an installation script, but it
can just as easily be a virus. You should always use more than one type of
backup technology for peace of mind. But it's good to know that there are
companies you can call when all else fails. |
|