NVIDIA Drives Liquid Cooling Adoption to Over 20% by 2025

Oct 30, 2024

Leave a message

 

The penetration rate of liquid cooling solutions is set to significantly increase, jumping from around 10% in 2024 to over 20% by 2025. According to the latest TrendForce survey, NVIDIA's Blackwell platform is expected to ship in the fourth quarter, which will boost the adoption of liquid cooling solutions. The growing global awareness of ESG, coupled with CSPs accelerating the deployment of AI servers, is facilitating a shift from air cooling to liquid cooling.

 

NVIDIA's Blackwell platform

 

In the global AI server market, NVIDIA remains the dominant supplier this year. In the GPU AI server segment, NVIDIA holds a commanding lead with a market share nearing 90%, while AMD trails at about 8%. TrendForce notes that although NVIDIA's Blackwell shipments are currently small due to ongoing supply chain testing, the new platform's high energy consumption-particularly the GB200 rack-mounted solution-demands improved cooling efficiency, likely increasing liquid cooling adoption. However, the existing server ecosystem's low liquid cooling ratio presents challenges, as ODMs need to navigate a learning curve to address leakage and cooling efficiency issues effectively. TrendForce anticipates that by 2025, over 80% of GPUs on the Blackwell platform will be high-end, prompting power supply and cooling companies to compete in the AI liquid cooling market, resulting in new industry dynamics.

 

 

I Google Aggressively Expands Liquid Cooling Solutions

 

In recent years, major U.S. cloud companies like Google, AWS, and Microsoft have rapidly built AI servers primarily powered by NVIDIA GPUs and proprietary ASICs. TrendForce reports that NVIDIA's GB200 NVL72 cabinet has a thermal design power (TDP) of approximately 140 kW, necessitating a liquid cooling solution, predominantly liquid-to-air (L2A). Other architectures, such as HGX and MGX Blackwell servers, primarily use air cooling due to lower density.

 

For cloud companies developing their AI ASICs, Google's TPU has adopted both air and liquid cooling solutions, making it a leader in liquid cooling among U.S. enterprises. BOYD and Cooler Master are key suppliers of cold plates. China's Alibaba is the most aggressive in expanding liquid-cooled data centers, while other cloud companies continue to favor air cooling for their AI ASICs.

 

TrendForce indicates that cloud companies will specify key component suppliers for the GB200 cabinet's liquid cooling solution. The primary suppliers for cold plates include Qihong and Cooler Master, while manifolds come from Cooler Master and Shuanghong, and coolant distribution units (CDUs) are supplied by Vertiv and Delta. Procurement for crucial leak-proof components, such as quick disconnects (QDs), remains dominated by foreign manufacturers like CPC, Parker Hannifin, Danfoss, and Staubli.

 

AI Server Key Component suppliers for Liquid Cooling Solutions

▲ AI Server Key Component suppliers for Liquid Cooling Solutions

 

 

II How to Address AI Chip Overheating? Explore 3 Server Cooling Methods

 

Before delving deeper into the cooling competition, it's essential to understand the primary cooling methods, which can be categorized into three types: air cooling, liquid cooling, and immersion cooling.

 

1. Air Cooling: Still Highly Demanded

Air cooling is the most widely used cooling method in data centers and enterprise server rooms, akin to providing cool air to servers through fans, heat sinks, and heat pipes. To achieve optimal cooling performance, advanced air cooling technology such as vapor chambers (3D VC) combined with heat pipes and numerous fans is necessary. However, while increased airflow and speed enhance heat convection, excessive noise and vibration can negatively impact the server environment. According to Wu Junying, Deputy General Manager, air cooling still holds significant market demand since H100 chips can be adequately cooled using air. However, with the shipment of GB series chips, the pace of liquid cooling adoption will accelerate.

 

2. Liquid Cooling: The Major Market Pursued by All Vendors

Liquid cooling, also known as direct liquid cooling (DLC), can be further divided into liquid-to-air and liquid-to-liquid.

Liquid-to-Air: This method uses water cooling pipes to carry away heat from chips, with the heated water being sent to fans at the back of the cabinet to disperse heat. Liquid-to-air cooling is a response to the physical limits of air cooling in existing data centers, as it requires minimal modification to server room infrastructure-just adding a fan back door can enhance cooling. Currently, about 60-70% of data centers still employ this cooling method. However, while liquid-to-air is a viable solution, it is not optimal; the added fan wall can raise noise levels to 90-100 decibels (equivalent to a busy street at around 80 decibels), making it difficult for staff to work in the room for extended periods.

 

Liquid-to-Liquid: This method involves enclosing sealed pipelines filled with coolant around the server's heat-generating components. Heat is transferred through thermal copper sheets to the coolant, enabling a cycle of hot and cold liquid exchange. Unlike liquid-to-air, this method does not require fan walls behind the server cabinets, significantly improving space utilization and reducing noise levels. NVIDIA's high-end GB200 NB072 uses liquid-to-liquid cooling.

 

3. Immersion Cooling: The Future Cooling Holy Grail?

Immersion cooling involves submerging entire servers in non-conductive liquid, akin to a hot bath, effectively cooling not just chips but also CPUs, memory, and other electronic components within servers. However, issues such as environmental concerns related to immersion liquids, the long-term effects on electronic components, and ongoing maintenance present significant challenges. Data centers considering immersion solutions must also evaluate the structural integrity of building floors and the underlying infrastructure for electrical and water systems. Implementing immersion cooling requires extensive facility redesign, resulting in substantial costs.

 

 

 

Send Inquiry