Cooling Capacity as a Bottleneck
Welcome to Keep Your Cool - a blog about simple cooling optimization strategies for the busy data center operators.
One of the takeaways from the DCW conference last month was the fact that Cooling Capacity is becoming a bottleneck. It was proposed that 64% of data centers are running out of capacity. This should come as no surprise to anyone in the industry as the adoption of, and explosion of, GPU heavy servers has far exceeded the Watts per Square foot that many data centers were designed for when they were constructed.
So why not just add more cooling? The answer is rarely that simple. Consider the following challenges:
Space: Is there room to add CRACs or in-row cooling units?
Integration: Can new units tie into existing heat rejection systems?
Capacity: Can chillers, towers, or dry coolers handle the added load?
Power: Is there enough electrical capacity to power more cooling—and still support additional IT load?
Water Use: Will it increase? Are there environmental or permitting limits?
So why not just add more cooling capacity you may ask? The answer is a complex one:
Is there space to put additional cooling units on the floor?
Can they integrate into the heat rejection scheme deployed?
Can the heat rejection hardware be expanded to take on the additional load?
Is there physical space for all this?
Is there sufficient power available to support the required infrastructure and still have reserves to power the additional IT services that it is intended to cool?
Fresh water usage could increase. Is there a limitation on this resource?
WE are just scratching the surface here and not considering any disruptions to service the implementation of such a plan might cause.
Implementing Liquid Cooling is proposed as a solution to the high wattage GPU servers being deployed in support of AI. However these are best served in Greenfield installations that were engineered for this infrastructure from the planning board. The retrofitting of legacy datacenters with liquid cooling is both cumbersome and costly, and the solution does not easily integrate with the legacy cooling systems.
It was also proposed at the DCW conference that 37% of cooling capacity was stranded at data centers. There are a number of reasons this might be true: Airflow management is the first that comes to mind.
Is the cooling located where it is needed?
Or think of it this way, has the IT load been placed where the cooling is optimal?
Is there airflow mixing taking place reducing the efficiency of the installed cooling?
Allowing hot and cold air to mix does not get you the best bang for your buck. A good containment solution will improve cooling efficiency and unlock some of the stranded capacity at the data center. We could go on and on about cable holes in the floor not equipped with air blocking grommets, blanking plates missing in racks, and misplaced perforated tiles or excessive perforated tiles.
The practice that always makes me shake my head when I encounter it is: The Switch or Appliance that is mounted backwards, for some sort of convenience, and is taking in air from the hot aisle and exhausting the super heated air into the cold aisle.
Conclusion
Every data center has a gating factor that ultimately determines what can be accomplished and that is the maximum power capacity installed. How that power is budgeted across the different demands of the data center will determine what is possible to accomplish.
IT Load should be the single greatest user of power from the capacity available.
Cooling Load is generally the next greatest user of power from the available capacity.
Overhead in the form of: Lights, Security, Monitoring, Sensors, all take a slice of the pie of available capacity.
UPS power conversions and floor mounted PDU transformers efficiency can take a surprisingly large bite out of power from the available capacity in older systems.
Therefore, the solution to the cooling bottleneck does not lie solely in adding more units, you must look at the overall power budget and efficiencies of each power user to determine what can be improved from an efficiency perspective, and determine what power capacity it frees up to support more cooling and IT load.
About the Author
Gregg Haley is a data center and telecommunications executive with more than 30 years of leadership experience. Most recently served as the Senior Director of Data Center Operations - Global for Limelight Networks. Gregg provides data center assessment and optimization reviews showing businesses how to reduce operating expenses by identifying energy conservation opportunities. Through infrastructure optimization energy expenses can be reduced by 10% to 30%.
In addition to Gregg's data center efforts, he has a certification from the Disaster Recovery Institute International (DRII) as Business Continuity Planner. In November of 2005, Gregg was a founding member and Treasurer of the Association of Contingency Planners - Greater Boston Chapter, a non-profit industry association dedicated to the promotion and education of Business Continuity Planning. Gregg had served on the chapter's Board of Directors for the first four years. Gregg is also a past member of the American Society of Industrial Security (ASIS).
Gregg currently serves as the Principal Consultant for Purkay Labs.
About Purkay Labs
Purkay Labs creates targeted tools and software to help data centers validate and optimize airflow at the rack level. Our flagship product, AUDIT-BUDDY, delivers fast, portable temperature and delta-T insights—no installation, no shutdown required. Whether you're planning an upgrade, validating containment, or troubleshooting a hot spot, our systems give you the data you need to make smarter cooling decisions.