The creation of a truly operational disaster site is one of the most difficult tasks that any IT person can ever confront. A working disaster site blends together all of the technologies of the computer field, everything from disks and CPUs to routers and communications lines. Operations tasks such as backups, archiving and computer monitoring take on critical importance, and the being on the bleeding edge of technology (especially communications and networking) becomes the norm.
Now that you’ve got your boss and the rest of management to back your disaster recovery project, you can get started. Where do you start? With lots and lots of planning, analysis and design.
Outsource the project or do it yourself?
You may want to outsource your entire disaster recovery project to a firm which specializes in such things. This has the advantage in that you are allowing people who are presumably experts at this to build and implement your plans. The disadvantage is you are losing a certain amount of control (unless you are very careful and very good at project management). In addition, outsourcing tends to be very expensive indeed.
At my company (a multi-billion dollar retail chain) we decided to create our own disaster recovery site. Primarily we did this for financial reasons – outsourcing was incredibly expensive. We also had a difficult time finding any company (this was about ten years ago) which had any expertise in the field at all.
If you do decide to do the project in-house, do not hesitate to hire experts to help you along the way. Do not make the mistake of trying to do it all yourself – there are people who do this for a living and they can help you do it right.
If you decide to completely outsource your project, be absolutely sure your contracts are rock solid. Manage the project closely and watch every dollar – disaster recovery, like most IT projects, tends to go over budget very quickly.
Set Your Goals
Decide up front what you want to accomplish. This is very, very important to measuring the success of your project. It is also important to be sure that everyone in your company (and affiliated companies) is crystal clear on those goals. For example, if your goal is to have a partially operational disaster site within 24 hours of a disaster being declared, then make sure that is stated. If not, you may find your managers expecting the disaster site to be ready immediately.
Determine your disaster site readiness
Some sites are defined as “cold”, meaning you more or less perform regular backups and send them off site. Your backup computers are ready to go, but would require restores of data, perhaps configurations and so on. Cold sites are much more simple to set up as massive communications are frequently not required (backup tapes can be manually transported). They are also a complete pain to get up and running in the event of a disaster.
You may want to create a “warm” site, meaning the computers are up and running but not immediately available. In this case, you might copy the data across communications lines but not restore it right away. Warm sites are more expensive, but are easier to bring on line. The equipment is up and running, data is moving over to them regularly, but in the event of a disaster operators must manually restore data and bring the site on line.
You may, as we did, want a “hot” site. This is really a duplicate computer room (and perhaps offices as well) which is ready to go at a moments notice. Data is not only copied over regularly – it is updated into databases on those systems immediately.
If you really want to go to the extreme, you might even define your site as “white hot”. This means the disaster site is not only being updated constantly but is monitoring your live site for failures regularly. In the event of a failure, the disaster site immediately takes over. You might need this level of functionality for a nuclear power plant or an airplane guidance system, for example.
I also know some managers who define “ice cold” sites. They simply had a computer room ready and kept their backup tapes ready. In the event of a disaster, they contracted with outside firms to be ready to set up the systems and restore the data. This is only useful if your company can survive extended (one or more weeks) downtime.
Figure out how much data is updated
This is one of the most critical questions – how much data are your applications generating (creating or updating). If your databases are just a few gigabytes, you should have no trouble getting the data to a disaster site. If, however, you are talking about terabytes of data per day, then you might find yourself spending a huge amount of money on communications lines.
In our project, we initially didn’t do a great job at this step. We did some initial estimates and based upon those purchased a T1 line. When we began to actually transmit information, however, we found the T1 was woefully inadequate. We had to replace the T1 with a T3, which involved a delay of over 18 months (the time for the phone company to install it) and several hundred thousand dollars of cost overruns.
Get a precise picture of the timing of data updates
This is another very critical question which must be answered precisely near the start of the project. You need to exactly graph how much data is being updated on your systems at precise times of the day (and through a normal week). Don’t forget that holidays create more data modifications (in general).
Why is this important? Well, your application may modify twenty gigabytes per day. This is not a huge amount of data to transmit to a disaster site – unless, of course, it is all updated in a one hour period. Depending on how available your site is, you may have to transmit that data very quickly, which means you may need a larger communications system.
Understand the Criticality of Each Application System
In our business, the “just-in-time” ordering process requires 100% up time. These applications are critical and the data must move over to the disaster site immediately. On the other hand, accounting data is only updated once per week (for example) and thus the data can make it’s way over to the disaster site much less quickly.
Be careful with this analysis, however. You may find that you have totally up-to-date information in your ordering system on the disaster site, but since the accounting data is stale a disaster would require an immense amount of manual corrections and data entry.
This question also determines which servers and workstations exist and the state they are in at the disaster site. A just-in-time ordering system may require the servers be ready to go at all times, while the accounting servers may be turned off and data restored from backup tapes. It just depends upon the application.
Okay, I think that’s enough information for now. Next time, we’ll continue the analysis and drill down on the various components of a working disaster system.