Redundancy as it relates to the data center is a relatively simple concept; resources are duplicated in order to provide fail-safe mechanisms. Cloud service providers have taken many traditional data center functions out of the hands of local IT managers but the majority of companies however still maintain local data centers for various reasons. CEOs and CIOs still like to know that they control their data and maintain custodianship. They rely on their data center engineers to ensure that services & data are highly available.
Data Center Failure Scenarios & How to Mitigate the Disruption
Component Failure – power supply or disk
The most common type of failure experienced in the modern data center is when a component fails. Components can include any part of a system such as a power supply, disk, or fan. The most common approach is known as the n+1 method. In this method, n is the number of components required to keep the system running. If a server needs 1 power supply to keep running, you make sure 2 are available. This method is typically enough for most needs. You could provide greater data center protection by increasing the excess components. The trade off is that more components means higher cost.
Assembly Failure – failure of a complete system such as server or storage array
The most effect method of protecting against server failure is by implementing virtualization. You can quickly restore a failed virtual server to a recent backup from prior to the failure. Better yet, some virtual platforms enable the use of live mirrored images. You also should continue to make use of the n+1 approach for full systems. An engineer can ensure that mirrored storage arrays are available and provide clustering technologies.
Room Failure, Building Failure, Site Failure
These types of failures include power outages, flood/fire, or ISP outages. In most cases, room failure is treated the same as a site or building failure since it’s not much more expensive to provide a 2nd room in the same building as providing a 2nd mirrored room in an entirely different site. The key strategy to avoid room failure entirely goes back to providing adequate component failure protection and environmental monitoring to detect issues before they shut down the entire room. Implement multiple fire suppression systems, climate control units, and flood detection systems. You will need a long distance mirror to properly handle site failure.
City or Regional Failure
A city or regional failure is generally due to a large event such as an earthquake, hurricane, or other major natural disaster. You should be looking at total mirroring of your data center over long distances. Your mirrored site may even have to be in a different country or continent.
Other Failures – Entire Country or World
This level of failure will likely place the organization in a position where the importance of data center redundancy becomes secondary to other issues. The individuals may be more focused on merely surviving. Events like this include wars or the zombie apocalypse. If you are still concerned about getting back up and running after such an event passes, consider long-term robust storage options so that your systems can be rebuilt when the time is right.
Your company or organization might not need to go full scale on redundancy but knowing how it all works is important. Engineers should build even the smallest startup with redundancy in mind so that if one day the company gets bigger, the foundations for the modern data center are already in place.