When Things Go Wrong

Servers, storage systems, backup systems and the software running on them or using them can all suffer failures, no matter how well run they are. Sometimes it's a physical hardware fault, sometimes it's a bug. In either case, to fix it we have to know that a failure occurred. If we are on-campus, we tend to notice a major problem ourselves quickly. But we still need to receive some sort of notification for after hours, as well as for things like network switches in buildings.

An important part of this is some kind of active monitoring. We have set up software to periodically check to see if servers (or, in some cases, services) are still reachable. If a server can't be reached for a period of time, an alert is sent via email to (generally) staff in the Systems and Networking and the Enterprise Applications group. The messages are also relayed, either through external email to a BlackBerry or to a cellphone via email to a carrier's SMS message gateway. In some cases, the problem can be fixed remotely. We have made some investments over time that allow us to address certain kinds of failures without having to come to campus.

In many cases, the pattern of an outage can allow us to narrow the problem down before we arrive on-site (if needed). When we had a power failure in the Learning Center and then a generator failure, anything connected to the Storage Area Network (SAN) failed. The specific pattern of the outage (SAN + SAN-attached servers + other hardware) allowed us to quickly realize that something had gone wrong in LC, and that it was probably a power problem.

There are limits to this. If our internet connection has failed, we aren't going to receive the alerts, so we won't know if there are other problems. Of course, we also won't be able to check our own Drew email from off-campus, reach other systems, or (for those of us who have them) use our BlackBerrys for data services. If the CNS Helpdesk is open, they have contact information to reach staff in case of a serious problem after normal business hours.

Also, a server can still be reachable, even if software running on it has suffered a failure and isn't working.

What More Can We Do?

There are some techniques that would allow us to engage in more active, service-level monitoring - actually making sure that pages can be read from web servers, print jobs can be handled by network print services, files can be read from network drives, etc. Designing and configuring these tests is time consuming, however, and we haven't been able to devote the effort needed. Also, without a method of automated alerts that doesn't relying on the Learning Center or our internet connection to reach phones or BlackBerry handhelds, the investment may not make sense. We are investigating some "out-of-band" methods of notification that don't depend on the campus internet connection or the Learning Center, but that will not happen in the near future.

  • No labels