The Rise of Liquid Cooling Solutions in AI Servers: A TrendForce Analysis

Oct 03, 2024

Leave a message

 

Ⅰ Introduction

 

According to the latest survey by TrendForce, the penetration rate of liquid cooling solutions is expected to surge, driven by the anticipated shipping of the NVIDIA Blackwell platform in the fourth quarter of 2024. This rate is projected to increase from around 10% in 2024 to over 20% in 2025. The growing global awareness of Environmental, Social, and Governance (ESG) factors, coupled with the accelerated construction of AI servers by Cloud Service Providers (CSPs), is paving the way for a significant shift from traditional air cooling to advanced liquid cooling solutions.

 

Liquid cooling offers several advantages over air cooling, including improved thermal management, reduced noise levels, and increased energy efficiency. As the demand for AI capabilities rises, particularly in data-intensive applications, the transition to liquid cooling systems becomes increasingly critical for maintaining optimal server performance.

 

A liquid cooling system installed in an AI server environment, showcasing the advanced cooling technology

▲ Liquid cooling system used in AI server environments

 

 

Ⅱ NVIDIA's Dominance in the AI Server Market

 

In the global AI server market, NVIDIA continues to reign supreme, holding a market share close to 90% in the GPU AI server segment as of 2024. AMD trails significantly with approximately 8% market share. This dominance is largely due to NVIDIA's cutting-edge technology and its robust ecosystem, which supports a wide range of applications in AI, machine learning, and data analysis.

 

TrendForce notes that the shipment scale of NVIDIA Blackwell this year is relatively small. This is primarily because the platform is undergoing final testing and validation processes within the supply chain, which require ongoing optimization in areas such as high-speed data transmission and cooling design. The Blackwell platform's increased energy consumption, particularly in the GB200 rack solution, necessitates superior cooling efficiency, further accelerating the adoption of liquid cooling solutions.

 

NVIDIA Blackwell platform designed for high-performance AI server applications

▲ NVIDIA Blackwell platform for AI servers

 

Despite the advantages of liquid cooling, the current server ecosystem still shows a low adoption rate of this technology. Original Design Manufacturers (ODMs) must navigate a learning curve to address challenges related to leakage and cooling efficiency. As the proportion of high-end GPUs on the Blackwell platform is expected to exceed 80% by 2025, this will ignite competition among power supply manufacturers and the cooling industry in the emerging AI liquid cooling market, leading to a new competitive landscape.

 

 

 

Ⅲ Accelerated Deployment by Large CSPs

 

Major cloud service providers, including Google, AWS, and Microsoft, have rapidly accelerated their AI server deployments in recent years, predominantly utilizing NVIDIA GPUs and self-developed ASICs. The thermal design power (TDP) of NVIDIA's GB200 NVL72 cabinet is approximately 140 kW, underscoring the urgent need for liquid cooling solutions to effectively manage heat dissipation. Liquid-to-Air (L2A) cooling methods are anticipated to become the mainstream approach in this context.

 

AI server deployment in modern data centers, highlighting the use of advanced cooling solutions

▲ AI server deployment in data centers

 

While NVIDIA's GPUs dominate, Google has also been proactive in exploring liquid cooling solutions for its Tensor Processing Units (TPUs). Google's commitment to this technology positions it as the most forward-thinking U.S. company in liquid cooling adoption. BOYD and Cooler Master are the primary suppliers for Google's cold plates, which are critical for maintaining optimal temperatures in high-performance computing environments.

 

In mainland China, Alibaba is aggressively expanding its liquid cooling data centers, further emphasizing the global shift towards this advanced cooling technology. Other cloud service providers primarily continue to rely on air cooling solutions for their self-developed AI ASICs, which may hinder performance compared to liquid-cooled systems.

 

 

Ⅳ Key Suppliers and Component Designation

 

As the shift to liquid cooling gains momentum, cloud service providers are designating key component suppliers for the GB200 cabinet's liquid cooling solutions. Currently, Qihong and Cooler Master are leading suppliers for cold plates, while manifold components are sourced from Cooler Master and Shuanghong. Coolant distribution units (CDUs) are provided by industry leaders such as Vertiv and Delta.

 

For crucial leakage prevention components like quick disconnects (QDs), manufacturers such as CPC, Parker Hannifin, Danfoss, and Staubli are heavily involved in procurement. As additional suppliers like Jiazhe and Fushida enter the validation stage, they will have opportunities to supply quick disconnect components in the first half of 2025, helping to mitigate the current supply-demand imbalance.

 

Various suppliers involved in the liquid cooling component supply chain for AI servers.

▲ Key suppliers for liquid cooling components

 

 

Ⅴ The Advantages of Liquid Cooling

 

Enhanced Cooling Efficiency

Liquid cooling systems are designed to remove heat more efficiently than air cooling systems. This is particularly important in AI applications where processors can generate significant heat due to high computational demands. By utilizing liquid to absorb and transfer heat away from components, servers can operate at lower temperatures, reducing the risk of thermal throttling and improving performance.

 

Space Optimization

Liquid cooling systems often occupy less space than traditional air cooling solutions, allowing for more efficient data center design. This space-saving attribute is particularly beneficial for organizations looking to maximize their server capacity without expanding their physical footprint.

 

Energy Efficiency

With increasing focus on sustainability and energy efficiency, liquid cooling solutions can help reduce the overall energy consumption of data centers. By minimizing the reliance on fans and air conditioning units, liquid cooling can lower energy costs and carbon footprints, aligning with the ESG goals many companies are striving to achieve.

 

Noise Reduction

Liquid cooling systems operate more quietly than traditional air-cooled systems, leading to a more pleasant working environment in data centers. This noise reduction is an important consideration for facilities located near populated areas or within office buildings.

 

 

Ⅵ Addressing Common Challenges

 

Despite the benefits, the transition to liquid cooling does come with challenges. These include:

 

Initial Costs

The upfront investment for liquid cooling systems can be higher than traditional air cooling solutions. Organizations must weigh these costs against the long-term benefits and savings in energy efficiency and maintenance.

 

Leakage Concerns

One of the most significant challenges associated with liquid cooling is the risk of leaks. Proper design, materials selection, and maintenance protocols are essential to mitigate this risk and ensure system reliability.

 

Maintenance Complexity

Liquid cooling systems require more complex maintenance compared to air-cooled systems. Organizations must train their staff or engage specialized service providers to ensure that liquid cooling solutions remain effective and free of issues.

 

 

Ⅶ Future Trends in Liquid Cooling

 

As the demand for AI computing continues to rise, several future trends are expected to shape the liquid cooling landscape:

 

Adoption of Hybrid Solutions

Hybrid cooling systems that combine both air and liquid cooling technologies are likely to gain traction. These systems can provide flexibility and efficiency, adapting to different workloads and operational needs.

 

Advanced Materials

The development of advanced materials for liquid cooling components can enhance performance and reliability. Innovations in materials science may lead to lighter, more durable, and more effective cooling solutions.

 

Integration with AI and IoT

The integration of AI and IoT technologies into cooling systems can optimize performance by enabling real-time monitoring and automated adjustments based on environmental conditions and server workloads.

 

Sustainability Focus

As companies increasingly prioritize sustainability, liquid cooling solutions using eco-friendly coolants and materials will become more prevalent. The industry may see a shift towards closed-loop systems to minimize waste and environmental impact.

 

 Future trends in liquid cooling technology

▲ Future trends in liquid cooling technology

 

 

 

Ⅷ Conclusion

 

The transition from air to liquid cooling solutions in AI servers represents a significant evolution in the industry, driven by advancements in technology and the increasing demand for efficient thermal management. With NVIDIA leading the charge in the AI server market and large cloud service providers like Google actively exploring liquid cooling options, the landscape is rapidly changing.

 

By understanding the benefits and challenges of liquid cooling, as well as the key players involved in its implementation, organizations can better position themselves to leverage this technology for improved performance, sustainability, and competitiveness in the AI landscape. As the industry moves forward, liquid cooling solutions will play a crucial role in shaping the future of data centers, ensuring they can meet the demands of next-generation AI applications.

 

 

 

 

Send Inquiry