【揭秘】價值超2000萬/機柜!英偉達最強AI服務器內部細節(jié)詳解!
可鑒智庫 2024-06-03 20:01 廣東
關注可鑒智庫,學習更多服務器產業(yè)干貨
2024年3月,英偉達發(fā)布了迄今為止最強大的 DGX 服務器。
據(jù)了解,DGX GB200 系統(tǒng)機柜分三大類,分別是 DGX NVL72、NVL32、HGX B200 ,其中 DGX NVL72 是該系列中單價最高、算力最強的 AI 系統(tǒng),內置 72 顆 B200 GPU 及 36 顆 Grace CPU,配備 9 臺交換器,整機設計由 NVIDIA 主導且不能修改,但 ODM 廠商可以自己設計 I / O 及以太網連接系統(tǒng)。
DGX GB200 將于今年下半年開始量產,預計 2025 年產量最高可達 4 萬臺。價格方面,NVL72 每座機柜單價可達 300 萬美元(約 2166 萬元人民幣。
那么,這款AI服務器內部有哪些關鍵細節(jié)?
▲Nvidia的DGX GB200 NVL72是一個機架級系統(tǒng),使用NVLink將72個Blackwell加速器網格化為一個大GPU
該系統(tǒng)被稱為 DGX GB200 NVL72,是 Nvidia在11月份展示的基于 Grace-Hopper 超級芯片的機架系統(tǒng)的演變。然而,這個處理器的 GPU 數(shù)量是其兩倍多。
1. 計算堆棧
雖然 1.36 公噸(3,000 磅)機架系統(tǒng)作為一個大型 GPU 進行銷售,但它由 18 個 1U 計算節(jié)點組裝而成,每個節(jié)點都配備了兩個 Nvidia 的 2,700W Grace-Blackwell 超級芯片 (GB200)。
▲這里我們看到兩個GB200超級芯片,在1U液冷機箱中減少了散熱器和冷板
大量部件使用 Nvidia 的 900GBps NVLink-C2C 互連將 72 核 Grace CPU 與一對頂級規(guī)格的 Blackwell GPU 結合在一起。 總體而言,每個超級芯片均配備 864GB 內存(480GB LPDDR5x 和 384GB HBM3e),根據(jù) Nvidia 的說法,在FP4 精度下可以推動 40 petaFLOPS 的計算能力。這意味著每個計算節(jié)點能夠產生 80 petaFLOPS 的人工智能計算,整個機架可以執(zhí)行 1.44 exaFLOPS 的超低精度浮點數(shù)學運算。
▲兩個B200 GPU與Grace CPU結合就成為GB200超級芯片,通過900GB/s的超低功耗NVLink芯片間互連技術連接在一起
▲英偉達的Grace Blackwell超級芯片,簡稱GB200,結合了一個CPU和一對1200W GPU
系統(tǒng)前端是四個 InfiniBand 網絡接口卡(請注意機箱面板左側和中心的四個 QSFP-DD 殼體),它們構成了計算網絡。該系統(tǒng)還配備了 BlueField-3 DPU,我們被告知它負責處理與存儲網絡的通信。
除了幾個管理端口之外,該機箱還具有四個小型 NVMe。
▲18個這樣的計算節(jié)點共有36CPU+72GPU,組成更大的“虛擬GPU”
▲NVL72的18個計算節(jié)點標配四個 InfiniBand 網絡接口卡和一個BlueField-3 DPU
憑借兩個GB200 超級芯片和五個網絡接口卡,我們估計每個節(jié)點的功耗為 5.4kW- 5.7kW。絕大多數(shù)熱量將通過直接芯片 (DTC) 液體冷卻方式帶走。Nvidia展示的 DGX 系統(tǒng)沒有冷板,但我們確實看到了合作伙伴供應商的幾個原型系統(tǒng),例如聯(lián)想的這個系統(tǒng)。
▲雖然英偉達展出的GB200系統(tǒng)沒有安裝冷卻板,但聯(lián)想的這款原型展示了它在實際生產中的樣子
然而,與我們從 HPE Cray系列中以HPC為中心的節(jié)點或聯(lián)想的 Neptune 系列中看到的以液體冷卻所有設備不同,Nvidia 選擇使用傳統(tǒng)的 40mm 風扇來冷卻網絡接口卡和系統(tǒng)存儲等低功耗外圍設備。
2. 將它們縫合在一起
在主題演講中,黃仁勛將 NVL72 描述為一個大型 GPU。這是因為所有 18 個超密集計算節(jié)點都通過位于機架中間的九個 NVLink 交換機堆棧相互連接。
▲積木式的搭建
▲在NVL72的計算節(jié)點之間是一個由九個NVLink交換機組成的堆棧,這些交換機為系統(tǒng)72個GPU中的每個提供1.8 TBps的雙向帶寬
Nvidia 的 HGX 節(jié)點也使用了相同的技術來使其 8 個 GPU 發(fā)揮作用。但是,NVL72 中的 NVLink 開關并不是像下面所示的 Blackwell HGX 那樣將 NVLink 開關烘烤到載板上,而是一個獨立的設備。
▲NVLink交換機集成到Nvidia的SXM載板中,如同上圖展示的Blackwell HGX板
這些交換機設備內部有一對 Nvidia 的 NVLink 7.2T ASIC,總共提供 144 100 GBps 鏈路。每個機架有 9 個 NVLink 交換機,可為機架中 72 個 GPU 中的每個 GPU 提供 1.8 TBps(18 個鏈路)的雙向帶寬。
▲此處顯示的是每個交換機上有兩個第5代NVLink高速連接系統(tǒng)
NVLink 交換機和計算底座均插入盲插背板,并具有超過 2 英里(3.2 公里)的銅纜布線。透過機架的背面,您可以隱約看到一大束電纜,它們負責將 GPU 連接在一起,以便它們可以作為一個整體運行。
▲如果你仔細觀察,可以看到形成機架NVLink背板的巨大電纜束
堅持使用銅纜而不是光纖的決定似乎是一個奇怪的選擇,特別是考慮到我們正在討論的帶寬量,但顯然支持光學所需的所有重定時器和收發(fā)器都會在系統(tǒng)已經巨大的基礎上,再增加 20kW電力消耗。
這可以解釋為什么 NVLink 交換機底座位于兩個計算組之間,因為這樣做可以將電纜長度保持在最低限度。
3. 電源、冷卻和管理
在機架的最頂部,我們發(fā)現(xiàn)了幾個 52 端口 Spectrum 交換機 — 48端口千兆 RJ45接口和4個QSFP28 100Gbps 聚合端口。據(jù)我們所知,這些交換機用于管理和傳輸來自構成系統(tǒng)的各個計算節(jié)點、NVLink 交換機底座和電源架的流式遙測。
▲在NVL72的頂部,我們發(fā)現(xiàn)了幾個交換機和六個電源架中的三個
這些交換機的正下方是從 NVL72 前面可見的六個電源架中的第一個 - 三個位于機架頂部,三個位于底部。我們對它們了解不多,只知道它們負責為 120kW 機架提供電力。
根據(jù)我們的估計,六個 415V、60A 電源就足以滿足這一要求。不過,Nvidia 或其硬件合作伙伴可能已經在設計中內置了一定程度的冗余。這讓我們相信它們的運行電流可能超過 60A。
不管他們是怎么做的,電力都是由沿著機架背面延伸的超大規(guī)模直流母線提供的。如果仔細觀察,您可以看到母線沿著機架中間延伸。
▲根據(jù)黃仁勛的說法,冷卻液被設計為以每秒2升的速度泵送通過機架
當然,冷卻 120kW 的計算并不是小事。但隨著芯片變得越來越熱和計算需求不斷增長,我們看到越來越多的數(shù)據(jù)中心服務商(包括 Digital Realty 和 Equinix)擴大了對高密度 HPC 和 AI 部署的支持。
就 Nvidia 的 NVL72 而言,計算交換機和 NVLink 交換機均采用液體冷卻。據(jù)黃仁勛介紹,冷卻劑以每秒 2 升的速度進入 25 攝氏度的機架,離開時溫度升高 20 度。
4. 橫向擴展
如果 DGX GB200 NVL72 服務器1.44 exaFLOPS的計算能力還不夠,那么可以將其中的 8 個網絡連接在一起,形成一個具有 576 個 GPU 的大型 DGX Superpod。
▲再用Quantum InfiniBand交換機連接,配合散熱系統(tǒng)組成新一代DGX SuperPod集群。DGX GB200 SuperPod采用新型高效液冷機架規(guī)模架構,標準配置可在FP4精度下提供11.5 Exaflops算力和240TB高速內存。
▲八個DGX NVL72機架可以串在一起,形成英偉達的液冷DGX GB200 Superpod
如果需要更多計算來支持大型訓練工作負載,則可以添加額外的 Superpod 以進一步擴展系統(tǒng)。這正是 Amazon Web Services 通過Project Ceiba所做的事情。
這款 AI 超級計算機最初于 11 月宣布,現(xiàn)在使用 Nvidia 的 DGX GB200 NVL72 作為模板。據(jù)報道,完成后該機器將擁有 20,736 個 GB200 加速器。然而,該系統(tǒng)的獨特之處在于,Ceiba 將使用 AWS 自主開發(fā)的 Elastic Fabric Adapter (EFA) 網絡,而不是 Nvidia 的 InfiniBand 或以太網套件。
英偉達表示,其 Blackwell 部件,包括機架規(guī)模系統(tǒng),將于今年晚些時候開始投放市場。
One rack. 120kW of compute. Taking a closer look at Nvidia's DGX GB200 NVL72 beast
1.44 exaFLOPs of FP4, 13.5 TB of HBM3e, 2 miles of NVLink cables, in one liquid cooled unit
Thu 21 Mar 2024 // 13:00 UTC
GTC Nvidia revealed its most powerful DGX server to date on Monday. The 120kW rack scale system uses NVLink to stitch together 72 of its new Blackwell accelerators into what's essentially one big GPU capable of more than 1.4 exaFLOPS performance — at FP4 precision anyway.
At GTC this week, we got a chance to take a closer look at the rack scale system, which Nvidia claims can support large training workloads as well as inference on models up to 27 trillion parameters — not that there are any models that big just yet.
Nvidia's DGX GB200 NVL72 is a rack scale system that uses NVLink to mesh 72 Blackwell accelerators into one big GPU.
Nvidia's DGX GB200 NVL72 is a rack scale system that uses NVLink to mesh 72 Blackwell accelerators into one big GPU (click to enlarge)
Dubbed the DGX GB200 NVL72, the system is an evolution of the Grace-Hopper Superchip based rack systems Nvidia showed off back in November. However, this one is packing more than twice the GPUs.
Stacks of compute
While the 1.36 metric ton (3,000 lb) rack system is marketed as one big GPU, it's assembled from 18 1U compute nodes, each of which is equipped with two of Nvidia's 2,700W Grace-Blackwell Superchips (GB200).
Here we see two GB200 Superchips, minus heatspreaders and cold plates in a 1U liquid cooled chassis
Here we see two GB200 Superchips, minus heatspreaders and cold plates in a 1U liquid cooled chassis (click to enlarge)
You can find more detail on the GB200 in our launch day coverage, but in a nutshell, the massive parts use Nvidia's 900GBps NVLink-C2C interconnect to mesh together a 72-core Grace CPU with a pair of top-specced Blackwell GPUs.
In total, each Superchip comes equipped with 864GB of memory — 480GB of LPDDR5x and 384GB of HBM3e — and according to Nvidia, can push 40 petaFLOPS of sparse FP4 performance. This means each compute node is capable of producing 80 petaFLOPS of AI compute and the entire rack can do 1.44 exaFLOPS of super-low-precision floating point mathematics.
Nvidia's Grace-Blackwell Superchip, or GB200 for short, combines a 72 Arm-core CPU with a pair of 1,200W GPUs.
Nvidia's Grace-Blackwell Superchip, or GB200 for short, combines a 72 Arm-core CPU with a pair of 1,200W GPUs (click to enlarge)
At the front of the system are the four InfiniBand NICs — note the four QSFP-DD cages on the left and center of the chassis' faceplate — which form the compute network. The systems are also equipped with a BlueField-3 DPU, which we're told is responsible for handling communications with the storage network.
In addition to a couple of management ports, the chassis also features four small form-factor NVMe storage caddies.
The NVL72's 18 compute nodes come as standard with four Connect-X InfiniBand NICs and a BlueField-3 DPU.
The NVL72's 18 compute nodes come as standard with four Connect-X InfiniBand NICs and a BlueField-3 DPU (click to enlarge)
With two GB200 Superchips and five NICs, we estimate each node is capable of consuming between 5.4kW and 5.7kW a piece. The vast majority of this heat will be carried away by direct-to-chip (DTC) liquid cooling. The DGX systems Nvidia showed off at GTC didn't have cold plates, but we did get a look at a couple prototype systems from partner vendors, like this one from Lenovo.
While the GB200 systems Nvidia had on display didn't have coldplates installed, this Lenovo prototype shows what it might look like in production
While the GB200 systems Nvidia had on display didn't have coldplates installed, this Lenovo prototype shows what it might look like in production (click to enlarge)
However, unlike some HPC-centric nodes we've seen from HPE Cray or Lenovo's Neptune line which liquid cool everything, Nvidia has opted to cool low-power peripherals like NICs and system storage using conventional 40mm fans.
Stitching it all together
During his keynote, CEO and leather jacket aficionado Jensen Huang described the NVL72 as one big GPU. That's because all 18 of those super dense compute nodes are connected to one another by a stack of nine NVLink switches situated smack dab in the middle of the rack.
In between the NVL72's compute nodes are a stack of nine NVLink switches, which provide 1.8 TBps of bidirectional bandwidth each of the systems 72 GPUs.
In between the NVL72's compute nodes are a stack of nine NVLink switches, which provide 1.8 TBps of bidirectional bandwidth each of the systems 72 GPUs (click to enlarge)
This is the same tech that Nvidia's HGX nodes have used to make its eight GPUs behave as one. But rather than baking the NVLink switch into the carrier board, like in the Blackwell HGX shown below, in the NVL72, it's a standalone appliance.
The NVLink switch has traditionally been integrated into Nvidia's SXM carrier boards, like the Blackwell HGX board.
The NVLink switch has traditionally been integrated into Nvidia's SXM carrier boards, like the Blackwell HGX board shown here (click to enlarge)
Inside of these switch appliances are a pair of Nvidia's NVLink 7.2T ASICs, which provide a total of 144 100 GBps links. With nine NVLink switches per rack, that works out to 1.8 TBps — 18 links — of bidirectional bandwidth to each of the 72 GPUs in the rack.
Shown here are the two 5th-gen NVLink ASICS found in each of the NVL72's nine switch sleds.
Shown here are the two 5th-gen NVLink ASICS found in each of the NVL72's nine switch sleds (click to enlarge)
Both the NVLink switch and compute sleds slot into a blind mate backplane with more than 2 miles (3.2 km) of copper cabling. Peering through the back of the rack, you can vaguely make out the massive bundle of cables responsible for meshing the GPUs together so they can function as one.
If you look closesly, you can see the massive bundle of cables that form the rack's NVLink backplane.
If you look closesly, you can see the massive bundle of cables that form the rack's NVLink backplane (click to enlarge)
The decision to stick with copper cabling over optical might seem like an odd choice, especially considering the amount of bandwidth we're talking about, but apparently all of the retimers and transceivers necessary to support optics would have added another 20kW to the system's already prodigious power draw.
This may explain why the NVLink switch sleds are situated in between the two banks of compute as doing so would keep cable lengths to a minimum.
Nvidia: Why write code when you can string together a couple chat bots?
Nvidia turns up the AI heat with 1,200W Blackwell GPUs
How to run an LLM on your PC, not in the cloud, in less than 10 minutes
Oxide reimagines private cloud as... a 2,500-pound blade server?
Power, cooling, and management
At the very top of the rack we find a couple of 52 port Spectrum switches — 48 gigabit RJ45 and four QSFP28 100Gbps aggregation ports. From what we can tell, these switches are used for management and streaming telemetry from the various compute nodes, NVLink switch sleds, and power shelves that make up the system.
At the top of the NVL72, we find a couple of switches and three of six powershelves.
At the top of the NVL72, we find a couple of switches and three of six powershelves (click to enlarge)
Just below these switches are the the first of six power shelves visible from the front of the NVL72 — three toward the top of the rack and three at the bottom. We don't know much about them other than they're responsible for keeping the 120kW rack fed with power.
Based on our estimates, six 415V, 60A PSUs would be enough to cover that. Though, presumably Nvidia or its hardware partners have built in some level of redundancy into the design. That leads us to believe these might be operating at more than 60A. We've asked Nvidia for more details on the power shelves; we'll let you know what we find out.
However they're doing it, the power is delivered by a hyperscale-style DC bus bar that runs down the back of the rack. If you look closely, you can just make out the bus bar running down the middle of the rack.
According to CEO Jensen Huang, coolant is designed to be pumped through the rack at 2 liters per second.
According to CEO Jensen Huang, coolant is designed to be pumped through the rack at 2 liters per second (click to enlarge)
Of course, cooling 120kW of compute isn't exactly trivial. But with chips getting hotter and compute demands growing, we've seen an increasing number of bit barns, including Digital Realty and Equinix, expand support for high-density HPC and AI deployments.
In the case of Nvidia's NVL72, both the compute and NVLink switches are liquid cooled. According to Huang, coolant enters the rack at 25C at two liters per second and exits 20 degrees warmer.
Scaling out
If the DGX GB200 NVL72's 13.5 TB of HBM3e and 1.44 exaFLOPS of sparse FP4 ain't cutting it, eight of them can be networked together to form one big DGX Superpod with 576 GPUs.
Eight DGX NVL72 racks can be strung together to form Nvidia's liquid cooled DGX GB200 Superpod.
Eight DGX NVL72 racks can be strung together to form Nvidia's liquid cooled DGX GB200 Superpod (click to enlarge)
And if you need even more compute to support large training workloads, additional Superpods can be added to further scale out the system. This is exactly what Amazon Web Services is doing with Project Ceiba. Initially announced in November, the AI supercomputer is now using Nvidia's DGX GB200 NVL72 as a template. When complete, the machine will reportedly have 20,736 GB200 accelerators. However, that system is unique in that Ceiba will use AWS' homegrown Elastic Fabric Adapter (EFA) networking, rather than Nvidia's InfiniBand or Ethernet kit.
Nvidia says its Blackwell parts, including its rack scale systems, should start hitting the market later this year.
關于我們
北京漢深流體技術有限公司是丹佛斯中國數(shù)據(jù)中心簽約代理商。產品包括FD83全流量自鎖球閥接頭,UQD系列液冷快速接頭、EHW194 EPDM液冷軟管、電磁閥、壓力和溫度傳感器及Manifold的生產和集成服務。在國家數(shù)字經濟、東數(shù)西算、雙碳、新基建戰(zhàn)略的交匯點,公司聚焦組建高素質、經驗豐富的液冷工程師團隊,為客戶提供卓越的工程設計和強大的客戶服務。
公司產品涵蓋:丹佛斯液冷流體連接器、EPDM軟管、電磁閥、壓力和溫度傳感器及Manifold。
未來公司發(fā)展規(guī)劃:數(shù)據(jù)中心液冷基礎設施解決方案廠家,具備冷量分配單元(CDU)、二次側管路(SFN)和Manifold的專業(yè)研發(fā)設計制造能力。
- 針對機架式服務器中Manifold/節(jié)點、CDU/主回路等應用場景,提供不同口徑及鎖緊方式的手動和全自動快速連接器。
- 針對高可用和高密度要求的刀片式機架,可提供帶浮動、自動校正不對中誤差的盲插連接器。以實現(xiàn)狹小空間的精準對接。
- 基于OCP標準全新打造的UQD/UQDB通用快速連接器也將首次亮相, 支持全球范圍內的大批量交付。
|