Server Reliability and Outside Air Cooling

Monday, July 11, 2011

As ASHRAE and the IT industry have been pushing the temperature and humidity boundaries for servers and IT equipment higher and higher one of the inevitable questions from data center operators is “what about failures?”    There is an assumption that these higher temperatures will lead to much greater downtimes…a situation that data centers cannot afford.

So as part of their work prior to publication of the new standards later this year, ASHRAE TC 9.9 developed a simple tool for estimating the increase in failure rates from various inlet temperature strategies.  This tool was developed by Intel based upon their history with server applications of their chips.

The methodology will be quite familiar to most HVAC engineers as it relies upon ASHRAE Bin Hour weather data.  Factors were developed called “x-factors” (not related to the TV program) that reflect the relative reliability in a particular temperature bin compared to the baseline server inlet temperature of 20 degrees C, or 68 degrees F.  If the x-factor is less than 1.0, then reliability is considered to be better.  If the x-factor is greater than 1.0 then reliability is considered to be worse.

However the committee recognized that the latest data center design practice uses outside air economizers or evaporative cooling instead of mechanical cooling.  The result of this practice is that the server inlet temperature will vary with outside air temperature.  X-factors were developed for six temperature bins from 15 degrees C (59 degrees F) up to 45 degrees C (113 degrees F).  Each bin is 5 degrees C wide (9 degrees F) to account for the temperature rise through the outside air handler or inefficiencies in the air distribution system.

The research behind the methodology indicated that the effect of operating at various temperatures is additive.  In other words, operating for 100 hours above 20 degrees C (x-factor over 1.0) and then operating at 100 hours below 20 degrees C (x-factor below 1.0) can yield the same reliability as operating for 200 hours at a steady 20 degrees C.

So the approach to calculating the reliability of the servers is:

1.       determine the total number of hours above 15 degrees C for a particular geographical location

2.       divide the number of hours within each of the six bins by the total hours to get the percentage of operating hours in each bin

3.       multiply the percentage of hours times the x-factor for that bin

4.       add the results to get the composite x-factor for the location

5.       multiply the current failure rate by the composite x-factor to get the new failure rate

There are several examples in the Appendix of the TC 9.9 white paper that show that even very hot, or hot and humid, cities end up with reliability figures that are better than one might expect.  For example, the composite x-factor for Phoenix in their examples ranged from 1.2 to 1.4 (depending upon cooling method).  If the normal failure rate for data centers in Phoenix is 0.2% then the outside air cooled data centers would have failure rates of 0.24% to 0.28%.  Looking at this another way, if the normal data center in Phoenix had 1,000 servers then they would normally lose 2 servers in a year.  By switching to outside air or an evaporative cooling system such as the Aztec ASC product the algorithm would predict losing 2.4 to 2.8 servers per year…just one more server per year.

More detail can be found in the TC 9.9 white paper.  The paper can be downloaded from http://tc99.ashraetcs.org/ .