Our disaster recovery project

Our disaster recovery project

Safe Computing is Like Safe SexAt Trader Joe’s, we decided to create our disaster site in-house. We started the project in 1994 and expected it to take about six months. Boy, were we naive! We were so confident, so sure of ourselves – and we had absolutely no idea what we were getting ourselves into.

Purpose of the project

The first thing to do is decide what you are trying to accomplish. Well, since our company is located in Southern California, our primary purpose was to protect against earthquakes. Because of this, we had to look at the worst case scenario – not only a total site failure, but the complete destruction of the local city.

What To Protect

The next thing we did is take a good look at what we had to protect from a disaster. At the time (and it changed drastically before we were finished), we had two large mainframe systems at the time, running about 1,500 application programs with about 40gb of data (which was a lot for the time). Our systems were large VAX computers, and it just so happened there was a product available at the time which supported something called remote shadowing. This theoretically would allow all disk writes occurring on our production systems to be mirrored to a disaster site.

Where to host the site

Before we could begin, we had to choose a place for the site. After some thought, we hired a group of students from Caltech to find us a location which was in a different earthquake zone than our main office. This survey found a nice location about 45 miles away which was safely on a different set of plates, and thus would not be subject to an earthquake at the same time as our primary site.

We actually also needed a warehouse, so we found the space in the area and built one. We made sure there was a large computer room, some office space and everything else that we needed.

Communications

We looked at a lot of options for moving the data to the disaster site. I know it’s tough to imagine, but in those days the internet did not exist in the same way that it does now. Communications was a much more complex task than it is now, and far more expensive.

We purchased a T1 line, which we figured was more than fast enough for our needs. It took the phone company a full 18 months to get the thing installed (12 months more than promised).

The Plan Goes Wrong

Once we started to put together all of the pieces we found that it was not working as well as we thought. Well, actually, the plan didn’t work at all. The problem was our database and applications. They were not very well written, with huge records. What we didn’t count on was our applications were doing a huge amount of writing to the disk. Far more writes than we had dreamed of.

The shadowing software basically copied every block that was changed from the production system to the disaster system. So much data changed on a second to second basis that the T1 simply could not support it all.

Back to the Drawing Board

Smarting from our failure, we went back to the drawing board. We learned a lot, most important to validate our assumptions before going forward. We installed performance monitoring tools and got a precise picture of exactly what was happening on our computer systems.

We knew this was now the key to success. Know before you go. We spent months mapping out everything. When was data moving around on our systems? Where was it going? What days was the I/O more or less? What about the holiday peaks? There were many meetings and lots of long discussions, but this time we were not going to fail.

The World Changes

In the meantime, our needs had been changing. The amount of data was growing at an incredible rate, and before the end of the century we found ourselves with data needs approaching half a terabyte. We began migrating our applications to Windows NT (at first 3.51, followed by 4.0 and later 2000) and the SQL database.

After our analysis we knew exactly what we needed to do. We installed a T3, which was very expensive but we knew, without doubt, that it would handle our throughput needs. We hired some very competent consultants and wrote our own data copying routines. We wrote up detailed specifications, tightly managed the project, tested every single assumption.

This time we were successful. In fact, the second disaster recovery project was completed on time to the day, exactly on budget and achieved all of the goals.

We later expanded our disaster recovery to include our other applications: SAP, a home grown forecasting system, and dozens of other databases and systems.

The Key To Success

We learned an incredible lesson – planning makes perfect. The first project failed because we did not test our assumptions, we did not plan well, and we did not control every single detail of the project.

The second project was completely and utterly successful because we learned our lesson. We thoroughly re-analyzed the project and tested all assumptions.  Most of all, we controlled every phase from beginning to end.

The point of all of this is a disaster recovery project is not to be taken lightly. It is complex and very easy to take lightly. After all, these are not daily production systems, so it’s common to think of them as somehow less important.

The final lesson we learned was to include disaster recovery in all future projects. It is an essential part of every project of any size, and it’s best to consider it from the beginning. That reduces costs and increases the chances for success.

Leave a Reply

avatar
  Subscribe  
Notify of