|
|
Analysts - Perkins Liu 前 言 / Introduction OCP全球峰會(huì)于10月15日至17日在加州圣何塞舉行了2024年全球峰會(huì)。此次峰會(huì)連續(xù)三年創(chuàng)下新紀(jì)錄,共有7047人參加,比2023年增加了60%,是迄今為止規(guī)模最大的一次。NVIDIA GB200 NVL72架構(gòu)及其基礎(chǔ)設(shè)施解決方案的統(tǒng)治地位驗(yàn)了AI進(jìn)一步鞏固了其作為OCP峰會(huì)上最大應(yīng)用地位的事實(shí)。 The Open Compute Project Foundation held its 2024 Global Summit from October 15-17 in San Jose, California. The summit set new records for the third consecutive year, with 7,047 people attending the event, a 60% increase over 2023, marking the event as the biggest yet. Artificial intelligence further fortified its position as the biggest application on the Open Compute Project platform, demonstrated by the dominance of NVIDIA GB200 NVL72 architecture and infrastructure solutions around it.
觀 點(diǎn) / The Take
英偉達(dá)公司(NVIDIA Corp.)的GB200 NVL72架構(gòu)成為2024年OCP全球峰會(huì)上的亮點(diǎn),引起了參會(huì)者的極大關(guān)注。各路服務(wù)器供應(yīng)商在其展臺(tái)的中心突出展示了NVL72架構(gòu)機(jī)架解決方案,而基礎(chǔ)設(shè)施制造商則在展位上主打相關(guān)配套的產(chǎn)品。這種焦點(diǎn)的融合反映了行業(yè)內(nèi)前所未有的合作模式,以應(yīng)對(duì)AI進(jìn)步推動(dòng)的機(jī)架密度挑戰(zhàn),強(qiáng)調(diào)了對(duì)GB200 NVL72的高期望。隨著人們對(duì)這種創(chuàng)新建筑的熱情高漲,這種狂熱的可持續(xù)性仍不確定——只有時(shí)間和市場(chǎng)才能證明這一點(diǎn)。 NVIDIA Corp.'s GB200 NVL72 architecture emerged as the standout highlight at the 2024 Open Compute Project (OCP) Global Summit, capturing significant attention from attendees. Server vendors prominently featured NVL72 architecture rack solutions at the center of their booths, while infrastructure manufacturers showcased complementary offerings throughout the expo floor. This convergence of focus reflects an unprecedented collaborative effort within the industry to address the increasing rack density challenge driven by AI advances, underscoring high expectations for the GB200 NVL72. As excitement builds around this innovative architecture, the sustainability of this frenzy remains uncertain — only time and the market will tell.
背 景 / Context OCP于2011年由Facebook(現(xiàn)為Meta Platforms Inc.)發(fā)起,其使命是利用開源和開放協(xié)作來加速和促進(jìn)硬件創(chuàng)新,從計(jì)算核心(包括服務(wù)器、存儲(chǔ)和網(wǎng)絡(luò)設(shè)備)到支持機(jī)架和整個(gè)數(shù)據(jù)中心基礎(chǔ)設(shè)施。靈感來自Facebook在俄勒岡州普賴恩維爾的超大規(guī)模數(shù)據(jù)中心的設(shè)計(jì)和建設(shè)實(shí)踐。2009年,一小群工程師聚集在那里,花了兩年時(shí)間從頭開始設(shè)計(jì)和建造數(shù)據(jù)中心包括:軟件、服務(wù)器、機(jī)架、電源供應(yīng)商和冷卻系統(tǒng)。 The OCP Foundation was initiated in 2011 by Facebook (now Meta Platforms Inc.) with the mission to take advantage of open source and open collaboration to speed up and foster hardware innovation, starting from the core of computing including servers, storage and networking equipment to the supporting racks and the entire datacenter infrastructure. The inspiration came from Facebook's design and construction practice of its hyperscale datacenter in Prineville, Oregon, where a small group of engineers gathered in 2009 and spent the next two years designing and building the datacenter from the ground up: software, servers, racks, power suppliers and cooling.
與該公司之前的數(shù)據(jù)中心設(shè)施相比,數(shù)據(jù)中心的能源效率提高了38%,運(yùn)營(yíng)成本降低了24%。雖然始于超大規(guī)模數(shù)據(jù)中心,但OCP擴(kuò)展了,將協(xié)作模式帶到非超大規(guī)模和邊緣數(shù)據(jù)中心,并進(jìn)一步擴(kuò)展到電信行業(yè)。 The datacenter was 38% more energy efficient to build and 24% less expensive to run than the company's previous facilities. Although starting from hyperscale datacenters, OCP expanded,bringing the collaboration model to non-hyperscale and edge datacenters and further to the telecom industry.
OCP擁有400多家會(huì)員公司,大約7000人積極參與其討論,并向所有人開放。OCP市場(chǎng)列出了270多個(gè)產(chǎn)品和400多個(gè)批準(zhǔn)的成員貢獻(xiàn),包括規(guī)范、設(shè)計(jì)和文檔(最佳實(shí)踐建議、參考架構(gòu)等),在峰會(huì)開幕時(shí)大約有300個(gè)。 OCP has more than 400 member companies and roughly 7,000 people actively participating in its discussions, which are open to all. The OCP marketplace lists more than 270 products and more than 400 approved member contributions, including specifications, designs and documents (best practice recommendations, reference architectures, etc.), numbered at roughly 300 as the Summit opened.
此次峰會(huì)設(shè)有23個(gè)動(dòng)態(tài)內(nèi)容專場(chǎng)(19個(gè)項(xiàng)目專場(chǎng)和4個(gè)特別專場(chǎng)),有超過610位演講者并超過425場(chǎng)會(huì)議。創(chuàng)新平臺(tái)有8個(gè)與OCP項(xiàng)目相關(guān)的站點(diǎn)和7個(gè)新興技術(shù)示范。OCP與合作伙伴共同舉辦了三個(gè)活動(dòng),包括開放式云網(wǎng)絡(luò)軟件(SONiC)研討會(huì)、內(nèi)存結(jié)構(gòu)論壇和DMTF可管理性研討會(huì)。這次活動(dòng)吸引了121家贊助商,圣何塞會(huì)議中心的兩個(gè)大廳里擺滿了100個(gè)展位。 The Summit featured 23 dynamic content tracks (19 project-focused and four special focus), with over 610 speakers and more than 425 sessions. Eight OCP project-related stations and seven emerging technology demonstrations were at the Innovation Village. Three colocated events were held between OCP and partners, including the Software for Open Networking in the Cloud (SONiC) Workshop, Memory Fabric Forum and DMTF Manageability Workshop. The event attracted 121 sponsors, and two full halls at the San Jose Convention Center were filled with 100 booths.
OCP本身有15名全職員工,比一年前增加了3名,超過250名志愿者在OCP許多項(xiàng)目中擔(dān)任領(lǐng)導(dǎo)角色,包括數(shù)據(jù)中心環(huán)境中的服務(wù)器、網(wǎng)絡(luò)、存儲(chǔ)、機(jī)架和冷卻,以及區(qū)域社區(qū)。 The foundation itself has 15 full-time staff, three more than one year ago, with more than 250 volunteers taking leadership roles across the many OCP projects including server, networking, storage, rack and cooling in a datacenter environment and beyond, as well as regional communities.
重大公告 / Major announcements 像往常一樣,在峰會(huì)上宣布了一些公告。然而,對(duì)OCP最重要的貢獻(xiàn)莫過于英偉達(dá)提供了Blackwell加速計(jì)算平臺(tái)設(shè)計(jì)中的基礎(chǔ)元素。這包括在OCP全球峰會(huì)上分享NVIDIA GB200 NVL72系統(tǒng)機(jī)電設(shè)計(jì)的關(guān)鍵要素。包括機(jī)架架構(gòu)、計(jì)算和開交換機(jī)箱、液體冷卻、熱環(huán)境和NVIDIA NVLink電纜盒體積的規(guī)格。這些技術(shù)細(xì)節(jié)旨在增強(qiáng)數(shù)據(jù)中心的計(jì)算密度和網(wǎng)絡(luò)帶寬。As usual, a few announcements were made at the Summit. However, none was more significant than the contribution of foundational elements of its Blackwell-accelerated computing platform design that NVIDIA made to the OCP. This includes sharing critical aspects of the NVIDIA GB200 NVL72 system's electro-mechanical design at the OCP Global Summit. The contributions encompass specifications for rack architecture, compute and switch tray mechanics, liquid cooling, thermal environments and NVIDIA NVLink cable cartridge volumetrics. These technical details aim to enhance compute density and networking bandwidth in datacenters. 該系統(tǒng)提供令人印象深刻的計(jì)算能力,擁有720千萬億次的訓(xùn)練和1.4千萬億次的推理任務(wù)。為了有效地管理高密度工作負(fù)載,GB200 NVL72采用了全液冷設(shè)計(jì),冷卻劑的入口溫度為45°C(113°F),出口溫度為65°C(149°F)。 This system is engineered to deliver impressive computational power, boasting 720 petaflops for training and 1.4 exaflops for inference tasks. To efficiently manage high-density workloads, the GB200 NVL72 employs a fully liquid-cooled design, using coolant temperatures ranging from 45°C (113°F) for inlet to 65°C (149°F) for outlet.
此前,英偉達(dá)已經(jīng)對(duì)不同時(shí)代的OCP做出了一些貢獻(xiàn),包括英偉達(dá)HGX H100基板設(shè)計(jì)規(guī)范。這些努力在為計(jì)算機(jī)制造商提供更廣泛的選擇,并促進(jìn)AI技術(shù)的廣泛采用。此外,英偉達(dá)還擴(kuò)大了其Spectrum-X以太網(wǎng)網(wǎng)絡(luò)平臺(tái)與OCP開發(fā)規(guī)范的一致性。這種一致性使組織能夠優(yōu)化使用OCP識(shí)別設(shè)備的AI基礎(chǔ)設(shè)施性能,確保在保持軟件一致性的同時(shí)保留現(xiàn)有投資。 Previously, NVIDIA has made several contributions to the OCP across various hardware generations, including the NVIDIA HGX H100 baseboard design specification. These efforts are intended to provide a wider array of options for computer manufacturers and facilitate the broader adoption of AI technologies. In addition, NVIDIA has expanded the alignment of its Spectrum-X Ethernet networking platform with OCP-developed specifications. This alignment allows organizations to optimize the performance of AI infrastructures that use OCP-recognized equipment, ensuring that existing investments are preserved while maintaining software consistency.
此外,下一代英偉達(dá) ConnectX-8 SuperNIC是Spectrum-X平臺(tái)的一部分,支持OCP社區(qū)的交換機(jī)抽象接口和SONiC標(biāo)準(zhǔn)。這使自適應(yīng)路由和基于遙測(cè)的堵塞控制成為可能,增強(qiáng)了大規(guī)模AI基礎(chǔ)設(shè)施的以太網(wǎng)性能。ConnectX-8超級(jí)網(wǎng)卡,能夠提供高達(dá)800gb /s,將于明年上市,為組織開發(fā)高適應(yīng)性的網(wǎng)絡(luò)解決方案。 Furthermore, the next-generation NVIDIA ConnectX-8 SuperNIC, part of the Spectrum-X platform, supports OCP Communities' Switch Abstraction Interface and SONiC standards. This enables adaptive routing and telemetry-based congestion control, enhancing Ethernet performance for large-scale AI infrastructure. The ConnectX-8 SuperNICs, capable of delivering up to 800 Gb/s, will be available next year, equipping organizations to develop highly adaptable networking solutions.
Meta正在將其為AI工作負(fù)載設(shè)計(jì)的高性能機(jī)架Catalina提供給OCP;谟ミ_(dá)Blackwell平臺(tái),Catalina強(qiáng)調(diào)模塊化和靈活性,同時(shí)支持英偉達(dá)GB200 Grace Blackwell超級(jí)芯片,以滿足現(xiàn)代AI基礎(chǔ)設(shè)施的需求。它的特點(diǎn)是開放機(jī)架V3 (Orv3),一個(gè)高功率機(jī)架(HPR),能夠支持高達(dá)140 kW,滿足GPU日益增長(zhǎng)的功率需求。液冷解決方案包括電源架、計(jì)算托盤、開關(guān)托盤、Orv3 HPR、Wedge 400光纖交換機(jī)、管理交換機(jī)、電池備份單元和機(jī)架管理控制器。Catalina的模塊化設(shè)計(jì)允許用戶為特定的AI工作負(fù)載定制機(jī)架,同時(shí)遵守現(xiàn)有和新興的行業(yè)標(biāo)準(zhǔn)。 Meta is in the process of contributing Catalina, its high-powered rack designed for AI workloads, to the OCP. Built on the NVIDIA Blackwell platform, Catalina emphasizes modularity and flexibility while supporting the NVIDIA GB200 Grace Blackwell Superchip to meet modern AI infrastructure demands. It features the Open Rack v3 (Orv3), a high-power rack (HPR) capable of supporting up to 140 kW, addressing the increasing power needs of GPUs. The liquid-cooled solution includes a power shelf, compute tray, switch tray, Orv3 HPR, Wedge 400 fabric switch, management switch, battery backup unit and rack management controller. Catalina's modular design allows users to customize the rack for specific AI workloads while adhering to both existing and emerging industry standards.
2022年,Meta推出了Grand Teton,這是繼Zion-EX平臺(tái)之后的下一代AI平臺(tái),用于處理內(nèi)存帶寬受限的工作負(fù)載需求。該公司已經(jīng)擴(kuò)展了大提頓平臺(tái),以支持AMD本能MI300X,并將此更新版本貢獻(xiàn)給OCP。 In 2022, Meta introduced Grand Teton, a next-generation AI platform succeeding the Zion-EX platform, to handle the demands of memory-bandwidth-bound workloads. The company has expanded the Grand Teton platform to support the AMD Instinct MI300X and is also contributing this updated version to the OCP.
其他公告包括OCP與Ecma國(guó)際(致力于信息和通信系統(tǒng)開放式標(biāo)準(zhǔn)化的領(lǐng)先全球標(biāo)準(zhǔn)制定組織)之間的戰(zhàn)略聯(lián)盟,OCP與數(shù)據(jù)中心零凈創(chuàng)新中心之間的戰(zhàn)略聯(lián)盟以及OCP Chiplet市場(chǎng)的開放,以建立開放式的芯片經(jīng)濟(jì)。 Other announcements include the strategic alliance between OCP and Ecma International, a leading global standards developing organization dedicated to the open standardization of information and communication systems, strategic alliance between OCP and Net Zero Innovation Hub for Data Centers and the opening of OCP Chiplet Marketplace in establishing an open chiplet economy.
生態(tài)系統(tǒng)協(xié)同 / Collaboration of ecosystem GB200 NVL72的機(jī)架密度為132 kW,需要直接對(duì)芯片進(jìn)行液體冷卻作為標(biāo)準(zhǔn)設(shè)計(jì),突出了AI進(jìn)步推動(dòng)機(jī)架密度的快速增長(zhǎng)。從2010年到2022年,數(shù)據(jù)中心機(jī)架密度從4kw上升到12kw,十年來穩(wěn)步增長(zhǎng)。雖然生態(tài)系統(tǒng)略有改善,但最近AI的激增導(dǎo)致機(jī)架密度急劇上升,預(yù)計(jì)在短短兩年內(nèi)將超過130千瓦。這種快速的變化讓生態(tài)系統(tǒng)措手不及,凸顯了從芯片到服務(wù)器、機(jī)架和整體數(shù)據(jù)中心基礎(chǔ)設(shè)施的各級(jí)協(xié)作的必要性。顯然,英偉達(dá)在這方面投入了大量精力。The GB200 NVL72 features a rack density of 132 kW and requires direct-to-chip liquid cooling as standard design, highlighting the rapid increase in rack density driven by AI advances. From 2010 to 2022, datacenter rack density rose from 4 kW to 12 kW, showing a steady increase over a decade. While the ecosystem responded with marginal improvements, the recent surge in AI has caused a dramatic spike in rack density, projected to exceed 130 kW in just two years. This rapid change has left the ecosystem unprepared, underscoring the need for collaboration across all levels — from chips to servers, racks and overall datacenter infrastructure. NVIDIA apparently has put quite a lot of effort into this.
在2023年的峰會(huì)上,服務(wù)器供應(yīng)商普遍展示了能夠處理70到100 kW密度的直接芯片冷卻的液體冷卻機(jī)架,以及他們自己的設(shè)計(jì)。然而,今年在同一展區(qū),美國(guó)超微、華碩、Wiwynn、和碩、英業(yè)達(dá)、Ampere、Aivres、云達(dá)科技和技嘉科技等廠商在展臺(tái)中央突出展示了GB200 NVL72解決方案,體現(xiàn)了強(qiáng)大的行業(yè)一致性。數(shù)據(jù)中心基礎(chǔ)設(shè)施制造商,包括英維克、臺(tái)達(dá)、威圖電子、得捷電子、雙鴻科技、CoolIT、JETCOOL和LiquidStack,也將其產(chǎn)品集中在GB200 NVL72架構(gòu)上。維諦技術(shù)宣布了其GB200 NVL72平臺(tái)的參考設(shè)計(jì)。 At 2023's summit, server vendors universally showcased liquid-cooling-ready racks capable of handling 70 kW to 100 kW densities with direct-to-chip cooling, with their own variations of design. This year, however, on the same floor, vendors like Super Micro Computer Inc., Asus, Wiwynn Corp., Pegatron Corp., Inventec Corp., Ampere, Aivres, QCT and GIGA-BYTE Technology Co., Ltd. prominently featured GB200 NVL72 solutions at the center of their booths, reflecting strong industry unity. Datacenter infrastructure manufacturers, including Shenzen Envicool Technology Co. Ltd., Delta Electronics Inc., Rittal, Lite-On Technology Corp., Auras Technology Co. Ltd., CoolIT, JETCOOL Technologies and LiquidStack, also centered their products around the GB200 NVL72 architecture. Vertiv announced its reference design for the GB200 NVL72 platform.
這種集體關(guān)注表明,市場(chǎng)對(duì)GB200 NVL72的預(yù)期很高,引發(fā)了市場(chǎng)的強(qiáng)烈興奮。這種狂熱還會(huì)持續(xù)多久?只有市場(chǎng)才能揭示結(jié)果。 This collective focus indicates high expectations for the GB200 NVL72, generating significant market excitement. How long will this frenzy remain? Only the market will reveal the outcome.
公眾號(hào)聲明:本文系由451Research授權(quán)DeepKnowledge并認(rèn)可的中文版本,僅供讀者學(xué)習(xí)參考,不得用于任何商業(yè)用途。
關(guān)于我們 北京漢深流體技術(shù)有限公司是丹佛斯中國(guó)數(shù)據(jù)中心簽約代理商。產(chǎn)品包括FD83全流量自鎖球閥接頭;液冷通用快速接頭UQD & UQDB;OCP ORV3盲插快換接頭BMQC;EHW194 EPDM液冷軟管、電磁閥、壓力和溫度傳感器及Manifold的生產(chǎn)。在國(guó)家數(shù)字經(jīng)濟(jì)、東數(shù)西算、雙碳、新基建戰(zhàn)略的交匯點(diǎn),公司聚焦組建高素質(zhì)、經(jīng)驗(yàn)豐富的液冷工程師團(tuán)隊(duì),為客戶提供卓越的工程設(shè)計(jì)和強(qiáng)大的客戶服務(wù)。 公司產(chǎn)品涵蓋:丹佛斯液冷流體連接器、EPDM軟管、電磁閥、壓力和溫度傳感器及Manifold。 - 針對(duì)機(jī)架式服務(wù)器中Manifold/節(jié)點(diǎn)、CDU/主回路等應(yīng)用場(chǎng)景,提供不同口徑及鎖緊方式的手動(dòng)和全自動(dòng)快速連接器。
|
|