Mastering the Heat: Cooling & Power Solutions for a 50kW Rack Density AI Data Center

by Sean Murphy on 2/27/24 11:42 AM

As artificial intelligence (AI) continues to reshape industries and drive innovation, the demand for high-performance computing in data centers has reached unprecedented levels. Managing the cooling and power requirements of a 50kW rack density AI data center presents a unique set of challenges. In this blog post, we will explore effective strategies and cutting-edge solutions to ensure optimal performance and efficiency in such a demanding environment. 

artificial-int

Precision Cooling Systems

The heart of any high-density data center is its cooling system. For a 50kW rack density AI data center, precision cooling is non-negotiable. Invest in advanced cooling solutions such as in-row or overhead cooling units that can precisely target and remove heat generated by high-density servers. These systems offer greater control and efficiency compared to traditional perimeter cooling methods.

Liquid Cooling Technologies

liquid-cooling-newsletterLiquid cooling has emerged as a game-changer for high-density computing environments. Immersive liquid cooling systems or direct-to-chip solutions can effectively dissipate heat generated by AI processors, allowing for higher power densities without compromising on reliability. Explore liquid cooling options to optimize temperature control in your data center.

High-Efficiency Power Distribution

To meet the power demands of a 50kW rack density, efficient power distribution is paramount. Implementing high-voltage power distribution systems and exploring alternative power architectures, such as busway systems, can enhance energy efficiency and reduce power losses. This not only ensures reliability but also contributes to sustainability efforts.

Redundancy and Resilience

A high-density AI data center demands a robust power and cooling infrastructure with built-in redundancy. Incorporate N+1 or 2N redundancy models for both cooling and power systems to mitigate the impact of potential failures. Redundancy not only enhances reliability but also allows for maintenance without disrupting critical operations.

Dynamic Thermal Management

Utilize intelligent thermal management systems that adapt to the dynamic workload of AI applications. These systems can adjust cooling resources in real-time, ensuring that the infrastructure is optimized for varying loads. Dynamic thermal management contributes to energy efficiency by only using the necessary resources when and where they are needed.

Energy-Efficient Hardware

Opt for energy-efficient server hardware designed for high-density environments. AI-optimized processors often come with advanced power management features that can significantly reduce energy consumption. Choosing hardware that aligns with your data center's efficiency goals is a key factor in managing power and cooling requirements effectively.

Monitoring and Analytics

Implement comprehensive monitoring and analytics tools to gain insights into the performance of your AI data center. Real-time data on temperature, power consumption, and system health can help identify potential issues before they escalate. Proactive monitoring allows for predictive maintenance and ensures optimal conditions for your high-density racks.

Successfully cooling and powering a 50kW rack density AI data center requires a holistic and forward-thinking approach. By investing in precision cooling, liquid cooling technologies, high-efficiency power distribution, redundancy, dynamic thermal management, energy-efficient hardware, and robust monitoring tools, you can create a resilient and high-performing infrastructure. Embrace the technological advancements available in the market to not only meet the challenges posed by high-density AI computing, but to excel in this dynamic and transformative era of data center management.

Author's Note:

Not a bad blog post, right? I was tasked with writing a blog post on how to power and cool high density racks for AI applications. So, I had Chat GPT write my blog post in 15 seconds, saving me a ton of time and allowing me to enjoy watching my kid’s athletic events this weekend. As end users embrace AI technology, it is imperative that we understand how to support the hardware and software that enables us to achieve these time saving technologies. Over the past 6 months, about 20% of my time has been spent discussing how to support customer 35kW to 75kW rack densities.

Additionally, another key to understand, is the balance of AI and the end-user’s ability to recognize limitations and areas for improvement. AI taps into the database of information that is the Internet. Powerful, but it does so (at least currently) in a fashion that makes it appear to be two years behind. For example, this blog post was written to reflect a 35kW rack density, and subsequently, ChatGPT noted 35kW. However, today, I’m regularly working with racks supporting AI that average 50kW, and have seen go up to 75kW… and know that applications can hit upwards of 300kW per rack. So, please note, anywhere in the blog where it says 50kW, human intervention made these necessary edits to AI's outdated "35kW".

Also, just for reference, a 75kW application requires 21 tons of cooling for one IT rack! So, these new high-density technologies require the equivalent of one traditional perimeter CRAC to cool one AI IT Rack. DVL is here to help provide engineering and manufacturing support to design your Cooling, PLC Switchgear, Busway Distribution, Rack Power Distribution, Cloud Monitoring, and other critical infrastructure to support your efficient AI Technology.

Read More

Topics: Data Center, Thermal Management, Data Center efficiency, beyond the product, artificial intelligence

The Sustainability and Efficiency of Our Data Centers

by Jodi Holland on 8/30/21 11:32 AM

As Dave Rubcich (Vertiv’s VP, Key Accounts- Multi-Tenant) puts it, “you can’t be sustainable without being efficient,” and “if you’re going to have a sustainable data center you’re certainly going to be efficient—but you can be efficient without being sustainable.” He cautions they are two different terms not to be confused with one another.

sustainability-header-bg-700x242

Data center energy efficiency has varying driving factors for the range of effects yielded. When infrastructure and equipment are more energy efficient in the way they work, one of the most favorable results is the fact that operating costs will go down. Also, less repairs are needed, and less equipment too, which results in more open space in your data center. Lastly, but perhaps most importantly, the less energy that is used, the less of an impact you have on the environment and its natural resources. That’s where sustainability comes into the picture.

Sustainability is becoming more and more companies’ priority, but can have different meanings depending on how you’re looking at the issue. Overall, we are trying to sustain the levels of natural resources we have on the planet so as not to contribute to global warming, or even my some miracle make a dent in efforts to reduce it. To work towards this, data centers are striving to have absolutely no impact on the planet.

It is a dream of an ideal scenario. The way Vertiv sees it, sustainability means zero losses, zero carbon, zero water, and zero waste. “We’re nowhere near there today,” Rubcich admits, “But if we don’t start thinking about it, we can never get there.” So, is it plausible to truly not use any natural resources? Not today, but down the road, it’s the long-term goal, but only once real efforts have been made to chip away at the issue. Rubcich adds, “If you’re going to be carbon neutral or carbon negative, you’re not going to be using generators that are running on diesel fuels.” Alternative energy sources will be a must, going into the future.sustainability
Elsewhere, in the case of cooling equipment that rely on water, and therefore the equipment’s WUE (water usage effectiveness) is measured, there has been considerable movement away from certain any of these technologies that use a large supply water. Total water usage is becoming a leading factor for companies’ decision-making criteria for new equipment.

So, in what other way can end-users start to include sustainability strategies in the present-day operations of their data center? Rubcich notes that there are already a number of products readily available on the market today that are going to help improve overall efficiency of the data center and will help drive some of the sustainability goals. For example, pumped refrigerant as an economizer, as is the case with the Vertiv DX system, which doesn’t use any water.

Vertiv, along with many other companies are ramping up their efforts to be innovative with all types of technologies. Companies like Microsoft, Google, Amazon, and are able to make commitments for sustainability milestones in the future. For example, Microsoft is committed to use 100% renewables by 2025, and to be carbon-negative by 2030. While some companies’ sustainability goals seem like far off pipedreams, they are on the right path as they have brought on C-suite level sustainability officers to create and implement certain strategies to attain these results. As Rubcich points out, “when you’re hiring [someone to focus on sustainability] at that level, you’re committed to it.” And it is that commitment that will make it a reality.

To explore more of this subject with Dave Rubcich, we invite you to listen to our recent Podcast, The Cooler Side of Data Center Sustainability.

Listen to the Podcast

Read More

Topics: efficient data center, Thermal Management, sustainability

To Replace or Not to Replace?

by Jodi Holland on 1/7/19 10:49 AM

Another year has come and gone. For some of you, this means your data center cooling equipment is another year older, and it may not be running as efficiently as you'd like.

So, the question becomes, to replace or not to replace? If you choose to replace the old equipment with the latest and greatest Vertiv has to offer, a whole lot of comfort and reliability will come with your purchase. If you’re on a much tighter budget, however, retrofitting your current equipment could be a viable option that will help you save energy today, while allowing it to last a few more years before you have to take the big leap to buy all new equipment. While possible equipment upgrades that can be considered include Liebert® iCOM™ and Liebert® EC fans, one equipment replacement might be a Liebert® DSE™.

 

Retrofit-Payback-Scenario-2

Before you make up your mind though, it’s best to consider your different options and how they would effect your annual energy costs in the coming years. Would you be saving big with one option over the other? And, if so, how long will it take for these energy savings to cover the cost of the project? For a breakdown of these possible costs (and more) take a look at two options that Vertiv presents in a retrofit payback scenario.

Need help with your decision? Connect with your DVL data center engineer.

 

Read More

Topics: Thermal Management, cooling, data center temperature control, data center cooling, vertiv, retrofit

Vertiv Expands Thermal Management Portfolio With Acquisition of Energy Labs

by Vertiv on 1/18/18 8:36 AM

vertiv_logo.png

COLUMBUS, Ohio--()--Vertiv, formerly Emerson Network Power, has acquired Energy Labs, a privately owned, U.S.-based manufacturer of custom air handling systems. The acquisition strengthens Vertiv’s leading position in the data center thermal management space and enables expansion into commercial and industrial segments with industry-leading cooling solutions.

Read full press release here.

Read More

Topics: Thermal Management, vertiv, energy labs

Finding the right architecture for power protection in hospitals

by Emerson Network Power on 3/30/16 8:42 AM

Finding-the-right-architecture-for-power-protection-in-hospitals.jpg

If you’ve read the post about distributed and centralized bypass architectures, you’re probably evaluating the right architecture for a new datacenter, or maybe you’re re-designing the one you’re currently using. The decision is not easy and it will often impact the operation and performance of the power protection system in your datacenter or the connected loads. Unfortunately, in technology there is rarely a simple “yes – no” or black – white” answer, and this holds true for power distribution as well. Yet, in technology and science, there’s a “grey area”, in which the ‘right’ decision is strongly influenced by the specific context and case, and is dependent on many parameters. Luckily, there are paths to find the best solution as a trade-off between the multiple parameters involved.

If you’re considering the use of an Uninterruptible Power Supply (UPS), it means you are worried about the possibility of utility power failures and the associated downtime problems that follow. Given this, the selection of the appropriate configuration or architecture for power distribution is one of the first topics of discussion, and the use of a centralized, parallel, distributed, redundant, hot-standby or other configurations available, becomes an important part of it.  While there are numerous architectures to choose from, there are also several internal variables that will require your attention. Fortunately, a few elementary decisions will make the selection easier. Even if not all parameters can be matched, it’s important to at least begin the conversation and explore trade-offs and other considerations. Without trying to be exhaustive (which would require a dedicated white paper), you should consider at least the following:

a) Cost: more complex architectures will increase both your initial investment and your cost, not only at the initial design stage but during the entire life of your power system, especially with regards to efficiency. In other words, we could say that complex architectures will increase your TCO.

b) Availability and reliability: how reliable should your power system be? And what about single or multiple points of failure? Would you need any type of redundancy?

c) Plans for growth: Do you expect your power demand or capacity to increase in the future? Will you re-configure your load distribution?

d) Related to the previous point, but highlighted separately because of its importance for UPS is modularity. Do you need a modular solution for future expansion or redundancy?

e) Bypass architecture; an important point as explained in a separate post.

f) Need for monitoring of the complete UPS power system, also considering any shutdown of loads, and in combination with other systems like thermal management.

g) Service and maintenance: Once the initial investment in power protection has been made, please do not forget to keep it at optimum conditions. This maintenance at regular intervals has to be achieved through service contracts, check for spares availability if multiple types of UPS are used, capability to isolate a subset, or use of remote diagnostic and preventive monitoring services such as Emerson Network Power’s LIFE for maximum availability.

h) Profile of the loads; especially if you’re considering a few large loads or many “small” loads (perhaps distributed across several buildings or in a wide area such as a wind farm), autonomy required for each load, peak power demands, etc.

In addition, the decision is not only related to the internal requirements of the power systems, but it is also linked to the type of load or application to be protected, as requirements and decisions may vary depending on the application being industrial, education, government, banking, healthcare or data center. For example, an application where the loads are servers which manage printers in a bank, compared to a hospital where the power protection systems may manage several surgery rooms, are by no means the same. In fact, in the case of bank printers, in the worst case they can be shut down, while in the case of the surgery rooms, their shutdown is not an option unless for scheduled maintenance. This is because a non-scheduled shutdown of the medical equipment in a surgery room would have a serious impact on the people inside that room for a surgical operation.

Let’s take the hospital example further and consider a particular case. In order to do a quick exercise and simplify, we can use a scenario with several surgery rooms as a reference (for example 5 to 20 rooms, each one with a 5-10 kVA UPS for individual protection), plus a small data center (for example with 30 kVA power consumption) and finally, other critical installations in the facility (let’s assume 300 kVA for offices, laboratories, elevators, etc.).

In this scenario, initially, the architectures that could be envisaged as a first step are:

1. Fully distributed, and for simplicity’s sake, a hospital with 10 surgery rooms is assumed here with 10 kVA for each surgery room plus a centralized UPS (>330 kVA) for the remaining loads.

2. A fully redundant solution based on a centralized UPS protecting all the loads (this UPS being in a parallel redundant configuration). The power for any of these UPS would be 300 kVA + 30 kVA + (10 x 10 kVA).

3. An intermediate solution, referred to as “redundant hot standby”, so that this redundant UPS is sized only for the surgery rooms (10 surgery rooms x 10 kVA), and with a bypass line connected to the large centralized UPS (>430 kVA). This solution shows the advantage of a smaller capacity required for this redundant hot standby UPS.

Emerson Network Power has done several simulations based on typical scenarios as the one described above for a hospital, and considered the factors for optimization a), b), e) and h). Considering the parameters for optimization, the energy savings (power consumption and heat dissipation), initial investment (CAPEX) as well as the maintenance costs (OPEX), the solution based on the “redundant hot standby” seems to be the most convenient.
Moreover, the difference between architectures 1 and 3 is larger as far as quantity of surgery rooms or period for cost simulation (from 1 year up to 10 years).
This points us in the right direction in selecting the best distribution architecture for this application in hospitals and using these parameters for optimization. Clearly, it can be enriched using the other parameters shown in the sections above, or adapted to the particular case (quantity of surgery rooms, autonomy for each load, power demanded by CPD room, reliability, …) that could lead to a different choice, but globally, this redundant hot standby has resulted in a good trade-off.

As said at the beginning, there is no magic solution for the optimum selection, but we have sought to explore several guidelines and check points that will help drive you towards the best solution for your case. Of course, any additional variables and the reader’s experience are welcome and can only serve to enrich the discussion.

Read More

Topics: Data Center, PUE, energy, UPS, Efficiency, Thermal Management, DCIM, Uptime, sustainability, energy efficiency, preventative maintenance, power system, healthcare, hospitals

Subscribe to Our Blog

Recent Posts

Posts by Tag

see all