Unlocking Efficiency: Top Ten Reasons You Need a Thermal Survey

Welcome to Keep Your Cool - a series tackling simple cooling optimization strategies for the busy data center operators by former busy data center operator, Gregg Haley.

Efficient data center cooling is no longer a luxury but a necessity, impacting both energy costs and overall performance. Thermal Surveys are a crucial tool for optimizing your data center environment because they help evaluate your environment and identify any problem areas. Today, we’ll explore what is a thermal survey and the top ten reasons you want to consider conducting one yourself, or using a third party service.

What is a Thermal Survey?

A thermal survey, a critical component of data center optimization, is a systematic examination that employs advanced thermal imaging technology to assess and analyze temperature variations within a data center environment. You can conduct a thermal survey by measuring and mapping temperature, humidity, and dew point at multiple elevations and across all aisles. This comprehensive evaluation provides a detailed visual representation of the thermal dynamics, identifying potential issues such as hot spots, over-cooling, or inefficient airflow. By utilizing purpose-built tools like Audit-Buddy, thermal surveys generate valuable insights, allowing data center operators to fine-tune cooling strategies, optimize energy efficiency, and ensure the seamless performance of critical IT infrastructure.

Top 10 Reasons Why You Need a Thermal Survey

  1. You don’t have enough sensors across the room

each cabinet in your data center should be equipped with at least three temperature and humidity sensors at both the front and back. However, the sheer volume of data points generated by millions of sensors can make continuous monitoring seem cost-prohibitive in terms of both time and resources. Often, data center operators resort to ad hoc sensor placement— maybe one sensor every couple of aisles—-leaving gaps in coverage. A thermal survey addresses this challenge by collecting information at each cabinet across the floor. Since it’s a periodic and not every day monitoringing, you can easily distill the data and figure out if you have the right sensor placement for continuous monitoring.

2. You don’t have enough sensors across the rack

Measuring multiple heights on a cabinet within a data center is crucial for obtaining a comprehensive understanding of the thermal dynamics and ensuring effective airflow management. Different heights on a cabinet may experience variations in temperature and humidity due to factors such as equipment placement, rack design, and overall airflow patterns. By capturing data at various elevations, you gain insights into potential hot spots, airflow obstructions, or inefficient cooling distribution within the cabinet. This detailed information allows data center operators to make informed decisions about optimizing the environment, identifying and rectifying any disparities in temperature, and ultimately enhancing the overall efficiency and performance of the critical IT infrastructure. Utilizing tools like Purkay Labs' thermal surveys, which measure multiple heights at every energized rack, provides a comprehensive view that goes beyond a single reference point, ensuring a more accurate representation of the thermal landscape within the data center.

3. If you have any SLA disputes

If you are in the Colocation business you most likely have contract language with your customer concerning thermal Service Level Agreements and financial remediations when they are broken. Servers today typically report when they experience overheating through the client’s network they are connected to and are being monitored from, but as a Colo provider you might not have access to that information. A hot spot in a data hall can occur for many reasons, however if undetected it can expose your business to contractual disputes over SLAs. Identifying potential and actual hot spots before a complaint is issued allows you to take remedial action and correct the situation before it becomes a problem you must deal with. The best part of our tool is you can collect this info without touching the customer’s equipment.


4. You’re overcooling your Data Center

 The overcooling of data center aisles is probably the most common waste of energy associated with the cooling infrastructure in a data center. I have been in data centers where poor airflow management resulted in three times the required cooling in operation to maintain temperatures. Certain areas were grossly overcooled. Every aisle in a data center is an unique environment whose needs are dictated by the amount of kilowatts the equipment in that aisle is consuming and  the CFM of air needed to attain a target Delta-T through the server. A 10-20 degree Delta-T is considered the good range. Knowing this target number and the kilowatt draw of the aisle one can calculate the CFM needed to attain that target. Looking at the CFM of what is being supplied to the aisle can easily identify areas of potential overcooling. These areas are easily identified with the Static Heat Maps the Purkay Labs Assessment provides.


5. You want to baseline your current environment before making a big change

Whenever you are embarking on a program to make changes in the data center environment it is important to have a baseline reading of the starting point, so when the work is completed and a second set of baseline readings are taken, the changes and improvements can be easily quantified. It also serves as a verification that unintended results have not occurred as a result of the changes that were made. Planning a journey always begins with knowing your starting point.


6. You’ve made changes to the Data Center layout, and your CFD is no longer accurate.

Static Heat Map

When you first design a data center, you rely on the CFD model — with those captivating flowing lines of color — to show what MIGHT happen based on the initial site criteria. However, once you become operational and start making changes to the data center layout — adding/removing servers etc. — the environment will change. Thermal Surveys, like the one Purkay Labs’ provide take real-time measurements of your current environment. We generate temperature, humidity and dew point Static Maps so you can view the thermal stratification and temperature changes across the face plane of the aisle, regardless if it is a hot or cold aisle. This allows you to SEE what is taking place rather than relying on a single or several reference points. Three elevation readings taken at every energized rack paints a more complete picture of what is taking place.


7. You want to see how the temperatures in a specific area change, over the course of a day or week, as workload on the servers change.

Going back to point one, you may not have consistent coverage across every cabinet, but there are going to be times when you want to know how an aisle is performing in different operational conditions. Monitoring temperature changes across a specific area over the course of hours, days, or weeks provides valuable insights into the dynamic nature of the environment. This analysis enables data center operators to identify patterns, trends, and fluctuations in temperature that may coincide with varying server workloads, scheduled changes in cooling infrastructure, or external environmental factors. By understanding how temperature evolves over time, operators can make informed decisions about adjusting cooling strategies, optimizing airflow, and ensuring consistent performance. Tools like Purkay Labs' thermal surveys, capable of placing stands across the data center floor to record temperature samples at predetermined intervals, offer a comprehensive view of the temporal dynamics, empowering operators to proactively address any temperature-related inefficiencies and enhance overall data center efficiency.


8. You want to understand your cooling efficiency

Server Delta-T is an indicator of cooling efficiency in the data center. Is the cooling supplied entering the server, absorbing heat (thus cooling the server) and being exhausted in the hot aisle for return to the cooling infrastructure?  The Delta-T is the temperature difference between the inlet and outlet temperatures. Any reading between 10 and 20 degrees Fahrenheit is deemed good. When used in conjunction with the Supply and Return temperatures at the CRAH it further helps define efficiency. Example: 70 degree air is measured at the server inlet; 90 degree air is measured at the Server exhaust port; 80 degree air is measured at the CRAH Return. What caused the temperature to drop between the exhaust port and the CRAH return? The mixing of cold air that has bypassed the server and leaked into the return air flow is the culprit. It could be a missing blanking plate, or an open wiring hole in the floor at the back of the rack, or an oversupplied cold aisle without containment. All conspire to reduce efficiency.


9. You want to calculate your cooling waste

There are multiple important delta-ts in the Data center, but by comparing the server ΔT to the air moving through the CRAC units- the CRAC ΔT, you can calculate the amount of cooling waste. Ideally, the CRAC ΔT and the server/cabinet ΔT should be the same. But this is rarely the case, and indicates that there air that is not reaching the server intake (bypass airflow), or some exhaust air might be returning to the server intake (recirculation airflow). By measuring the server delta-T and the CRAC delta-T, you can evaluate your air management and determine what is causing hot spots or other airflow issues. With Purkay Labs’ software, you can quickly visualize your cold air performance and effectiveness.


10. You are interested in reducing your overall carbon footprint.

Everyone has carbon footprint awareness these days. Reducing one's carbon footprint usually comes with the benefit of a better PUE. It can also reduce the Scope II emissions that are reported. This is usually accompanied by a reduced OPEX as well. Optimizing the cooling in a data center has many benefits to the business. A Purkay Labs assessment can start you on the right track to improving efficiency. Do you know your score today on the efficiency chart? Are there opportunities for additional improvements? As everyone knows, “You can’t manage what you can’t measure.”  

About the Author

Gregg Haley is a data center and telecommunications executive with more than 30 years of leadership experience. Most recently served as the Senior Director of Data Center Operations - Global for Limelight Networks. Gregg provides data center assessment and optimization reviews showing businesses how to reduce operating expenses by identifying energy conservation opportunities. Through infrastructure optimization energy expenses can be reduced by 10% to 30%.

In addition to Gregg's data center efforts, he has a certification from the Disaster Recovery Institute International (DRII) as Business Continuity Planner. In November of 2005, Gregg was a founding member and Treasurer of the Association of Contingency Planners - Greater Boston Chapter, a non-profit industry association dedicated to the promotion and education of Business Continuity Planning. Gregg had served on the chapter's Board of Directors for the first four years. Gregg is also a past member of the American Society of Industrial Security (ASIS).

Gregg currently serves as the Principal Consultant for Purkay Labs.

Previous
Previous

Optimizing Containment: A Post-Installation Guide

Next
Next

The Five W’s of the Thermal Heat Map - Who, What, Where, When, and Why.