Causes and costs of data center outages

Two related studies point to power infrastructure as a frequent cause, and put a significant price tag on downtime for many users.

Pennwell web 375 250

Two related studies point to power infrastructure as a frequent cause, and put a significant price tag on downtime for many users.

By Patrick McLaughlin

Two research reports published just a few months apart look closely at the causes and costs of unplanned data center outages, respectively. The reports, entitled “National Survey on Data Center Outages” and “Calculating the Cost of Data Center Outages,” were written, and based on research conducted, by Ponemon Institute (www.ponemon.org). Both were commissioned by Emerson Network Power (www.emersonnetworkpower.com). The “causes” report was published in September 2010 and the “cost” report was published in February 2011.

In an interview with Cabling Installation & Maintenance, Ponemon Institute founder and chairman Larry Ponemon, Ph.D. explained, “The two studies are related, and not tangentially. When we scoped out our original project, we needed to understand data center outages—causes and remedies available. For that we surveyed about 450 people with some responsibility for data center operation. We found that about 95 percent of the participating organizations had experienced one or more unplanned outages within the previous two years.” Therefore, he said, most of those surveyed for the “causes” report were knowledgeable in the experience of an unplanned outage.

The two reports are not based on the same set of data, as Dr. Ponemon noted. “The first is a study of individuals. The second is a study of organizations.” That first study, he said, revealed that the occasion—an unplanned data center outage—“is a pretty common set of events. Companies do have full or partial disruptions or shutdowns, for a lot of reasons.”

As the first study indicates, approximately 80 percent of respondents identified the root cause(s) of their shutdown(s). “Twenty percent either don’t know or don’t want to tell you,” Dr. Ponemon stated. Among those who identified the reasons for outages, the uninterruptible power supply (UPS) was featured prominently, occupying three of the top four spots on that unenviable list. The most-frequently cited cause was UPS battery failure (65 percent), followed by UPS capacity being exceeded (53 percent) and later by UPS equipment failure (49 percent). The third most-frequently cited reason for an unplanned outage was human error or accidental activation of the emergency power-off (EPO) switch (51 percent). Trailing those four causes were water incursion (35 percent), heat-related/computer-room air-conditioner failure (33 percent) and power-distribution unit/circuit-breaker failure (also 33 percent).

Pennwell web 375 250
When a data center experiences an unplanned outage, the causes may be many but the costs almost assuredly will be steep..

Dr. Ponemon noted that the most common response to an outage was to repair, replace or add equipment or to contact the equipment vendor for support.

He also added that among survey respondents, “fifty-seven percent believe all or most unplanned outages could be prevented.”

Trends take hold

The report states that the primary causes of data center downtime have a direct correlation to current trends and challenges that data centers face. The following four paragraphs are taken from the report, identifying and explaining what the report describes as “four key industry drivers that directly impact availability.”

Increasing data center capacity. As demand for IT applications grow and more servers and storage are added to the data center, the supporting IT infrastructure must grow as well. If the IT demand outgrows what the critical infrastructure can supply, downtime will occur. Downtime root cause correlation: UPS capacity exceeded and PDU/circuit breaker failure.

Rising rack densities. With the introduction of blade servers and other high-performance IT equipment, the typical server rack will contain well over 10 kWh of IT. High heat densities will require precision cooling closer to the server. However, depending on the cooling design, this also could bring water closer to the server. Downtime root cause correlation: Water incursion and heat-related/CRAC failure.

Data center efficiency. Data centers consume a lot of electricity and many managers are evaluating high-efficiency power and cooling technologies such as transformerless UPS or air economizers that provide cost reductions but may not provide the highest reliability or ideal operating environment. Efficiency should not come at the expense of availability, especially in critical data centers. Downtime root cause correlation: UPS failure, heat-related/CRAC failure and IT equipment failure.

Need for infrastructure management and control. The data center manager’s requirements of improving availability, increasing efficiency, maximizing density and planning for capacity all can be managed through infrastructure management. Monitoring the float charge of a battery, knowing optimal placement of a new server to even having a people-free facility with remote resolution all are aspects of successful infrastructure management. Downtime root cause correlation: UPS battery failure and accidental EPO/human error.

Counting costs

As Dr. Ponemon stated, the two studies are related and the information gathered for the “causes” study gave rise to some of the information sought after in the “costs” study. In conducting the research, the institute used activity-based costing, which it describes as core process-related activities that drive a range of expenditures associated with a company’s response to a data outage. Specifically, The Ponemon Institute defined the activities/costs as follows.

  • Detection cost: Activities associated with the initial discovery and subsequent investigation of the partial or complete outage incident
  • Containment cost: Activities and associated costs that enable a company to reasonably prevent an outage from spreading, worsening or causing greater disruption.
  • Recovery cost: Activities and associated costs that relate to bringing the organization’s networks and core systems back to a state of readiness.
  • Ex-post response cost: All after-the-fact incidental costs associated with business disruption and recovery.
  • Equipment cost: The cost of new equipment purchases and repairs, including refurbishment.
  • IT productivity loss: The lost time and related expenses associated with IT personnel downtime.
  • User productivity loss: The lost time and related expenses associated with end-user downtime.
  • Third-party cost: The cost of contractors, consultants, auditors and other specialists engaged to help resolve unplanned outages.
  • Lost revenues: The total revenue loss from customers and potential customers because of their inability to access core systems during the outage period.
  • Business disruption (consequences): The total economic loss of the outage including reputational damages, customer churn and lost business opportunities.

In the executive summary of “Calculating the Cost of Data Center Outages,” the institute states evidence exists that IT administrators are underestimating the economic impact unplanned outages have on their operations.

The analysis provided in the report focuses on representative samples of organizations in the United States that experienced at least one complete or partial unplanned outage during the 12 months prior to the survey being completed. Forty-one separate data centers participated.

Ranked by type and cost

“Our research,” the executive summary continues, “indicates data center outages have serious financial consequences for an organization. According to the study, the cost of a data center outage ranges from a minimum cost of $38,969 to a maximum of $1,017,746 per organization, with an overall average cost of $505,502 per incident.” Not surprisingly, the study found that the total cost of an outage is systematically related to the outage’s duration and to the size of the data center.

Some causes prove more costly than others. Here is a list of the average total cost of an outage classified by the outage’s root cause.

  • IT equipment failure: $750,326
  • UPS system failure: $687,700
  • Other root causes: $612,993
  • Water, heat or CRAC failure: $489,100
  • Generator failure: $463,890
  • Weather-related: $395,065
  • Accidental/human error: $298,099

The study also charted the average cost of an outage organized by nine categories of costs.

  • Business disruption: $179,827
  • Lost revenue: $118,080
  • End-user productivity: $96,226
  • IT productivity: $42,530
  • Detection: $22,347
  • Recovery: $20,884
  • Ex-post activities: $9,537
  • Equipment costs: $9,063
  • Third parties: $7,008

The studies’ sponsor Emerson Network Power also has authored a paper providing its analysis of the Ponemon Institute cost study. That paper is entitled “Understanding the cost of data center downtime: An analysis of the financial impact on infrastructure vulnerability.”

Both of the Ponemon Institute reports as well as the Emerson-authored paper are available on Emerson Network Power’s Web site.

Patrick McLaughlin is chief editor of Cabling Installation & Maintenance.

More CIM Articles
Past CIM Articles

More in Data Center