Disaster Recovery

Disaster recovery is a fairly straightforward concept - in the event of a catastrophic system failure, significant hardware damage (fire, flood, etc.) or human error you want to be able to recover a reasonably recent version of the data and make it available again as quickly as possible. In the case of a critical system or service, you want to restore it to operation as quickly as possible.

However, backups only have the data as it was when the backup was created, and we can't assume that the hardware needed to provide the data or service to the end user wasn't destroyed. It could take days to weeks to replace major equipment such as the SAN and restore data to it.

Business Continuity

The concept of business continuity attempts to insure that an organization can keep functioning somewhat normally in the event of a major disaster. For example, in our environment, a fire in the Learning Center would destroy the SAN, the telephone switch, our outside phone and internet connections, one of the core network switches and fiber optic connections to many buildings as well as a significant amount of server hardware. A business continuity solution attempts to minimize or prevent that disruption by adding additional hardware and software in another building or location. In the event of a catastrophe, business can continue on this "redundant" hardware.
 
 We have a partial example of this on campus - our Netware cluster (Causeway). Either server in the cluster could suffer a failure, and it would continue to run on the other server in the cluster. However, the Causeway cluster relies on the SAN, so a SAN failure would cause the services to fail. Inside the SAN cabinet, there are two "controller" modules, two fibre channel switches, dual power modules and a disk array set up in a fault-tolerant configuration. We can lose a controller, a fibre channel switch, a power module and a few disks without interrupting operations. We have only had three significant SAN failures in three years, two caused by problems with backup power systems cutting all power to the hardware, and one (on our old SAN hardware) caused by a catastrophic software failure that we weren't entirely able to diagnose.

  • No labels