From Zero to Hero: Mastering High-Speed Networking with NVIDIA Networking

The Value of Infiniband

Infiniband and Related Technical Background

  • Ethernet: Started in 1973, the most popular and successful network ecosystem to date, widely used for interconnecting computers.
  • Infiniband: Started in 1999, it is the best performing network system, and this technology is designed for high-performance application scenarios.
  • RoCE: Started in 2010, the transplantation of IB RDMA features on Ethernet technology

NVIDIA Infiniband Switch

Image Source: Pexels

Infiniband Core Value

Highest application performance

  • High bandwidth (Gbps)
  • Low latency (ns/p2p, ns/hop)
  • High packet throughput (Mpps)
  • Network computing-SHARP (4x better in MPI latency)

Maximized CPU utilization

Unload CPU workload, no need for CPU to intervene in data plane processing-Native RDMA

Optimal network utilization

Availability (HA) = SHIELD/adaptive routing/congestion control

Scalability-native software-defined network, centralized management, network scale expansion without additional overhead

 

Infiniband Performance Advantages

Hardware Dimensions

lantencyIB (ConnectX-6 HDR)RoCE (ConnectX-6 Dx 200GE)TCP/IP (ConnectX-6 Dx 200GE)
Back-to-back interconnection~700ns~900ns4000-5000ns
 Through the switch700 + ~130ns900+ ~300ns5000+ ~300ns
Bandwidth vs CPU usageIB (ConnectX-6 HDR)RoCE (ConnectX-6 Dx 200GE)TCP/IP(ConnectX-6 Dx 200GE)
 CPU cores occupied1 core1 core24 cores
Bandwidth197Gbps189Gbps190Gbps

Software Dimension

  • In-Network Computing-SHARP feature reduces latency brought by distributed computing
  • SHIELD-fast network link self-repair, improve network availability
  • Adaptive Routing-optimal network routing selection, reduce latency and congestion

MPI AllReduce Latency Performance (128 nodes)

Image Source: NVIDIA

Ecological Dimension

  • HPC TOP500 ranking leading network technology supplier
  • Open MPI based HPC-X/UCX, a large cluster communication framework widely verified in the HPC/AI field
  • Practices on 6000*32(N*PPN) cluster, based on HPC-X 2.7/UCX 1.9/Slurm 20.11 in China domestic customer

Infiniband Products

IB card/nvidia infiniband switches/router/gateway

NVIDIA Infiniband Products

Image Source: NVIDIA

(Read “NVIDIA Networking – InfiniBand Solutions” for additional information)

Network Management Tool – UFM (Unified Fabric Manager)

UFM Telemetry

A component of NVIDIA’s Unified Fabric Manager (UFM) that specializes in detailed network monitoring and performance data collection.

  • Real-time Data Collection: Gathers real-time metrics such as bandwidth, latency, and packet loss.
  • Visualization: Offers tools for visualizing network performance through charts and reports.
  • Historical Data Storage: Stores historical performance data for trend analysis.

Alert Mechanism: Sends automated alerts for anomalies or performance bottlenecks.

UFM Enterprise

Designed for large-scale enterprises and data centers, providing comprehensive network management and optimization features.

  • Centralized Management: Unified platform for monitoring and managing the entire network infrastructure.
  • Automated Configuration: Reduces complexity and human error through automated network configuration.
  • High Scalability: Capable of managing thousands of nodes in large network environments.
  • Performance Optimization: Continuously monitors and optimizes network performance.
  • Comprehensive Reporting: Generates detailed reports on network health and performance.

UFM Cyber AI

An advanced security module within UFM that leverages artificial intelligence to enhance network security and reliability.

  • Anomaly Detection: Uses AI and machine learning to identify abnormal behaviors and potential security threats.
  • Automated Protection: Automatically implements countermeasures like blocking suspicious traffic or isolating compromised nodes.
  • Threat Intelligence: Integrates up-to-date threat intelligence to combat new types of attacks.
  • Security Reporting: Provides detailed reports on security incidents for analysis.
  • Continuous Learning: AI models continuously learn and adapt to evolving threats. (Read “NVIDIA Unified Fabric Manager (UFM)for additional information)

 

Infiniband key features:

Lossless

  • Physical: Lower bit error rate target 10e-15 (1000 times that of Ethernet)
  • Link: Credit-based port-to-port flow control to ensure sufficient buffer on the receiving side and no active packet loss

NVIDIA Infiniband Feature of Lossless

Image Source: NVIDIA

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

It is a technology used to improve communication efficiency in high-performance computing (HPC) and artificial intelligence (AI) applications. It mainly improves the overall performance of computing clusters by optimizing data aggregation and reducing communication overhead.

Specifically, SHARP technology plays a key role in the following aspects:

  • Data aggregation: In a distributed computing environment, many operations require data from multiple nodes to be aggregated together, such as gradient aggregation in deep learning. SHARP reduces the transmission time and bandwidth consumption of data in the network by implementing these aggregation operations in the network layer.
  • Hierarchical communication: SHARP adopts a hierarchical communication strategy to effectively utilize the network topology. For example, in a multi-level switch structure, SHARP can perform partial data processing on the switch node, reducing the amount of data transmitted to the computing node.
  • Hardware acceleration: SHARP implements some key operations on the network hardware, such as directly supporting these aggregation and reduction operations on nvidia InfiniBand switches, which avoids multiple transmissions of data in the network and improves overall communication efficiency.
  • Scalability: SHARP was designed to work in large clusters, and by optimizing communication patterns, it can remain efficient when scaled to thousands or even tens of thousands of nodes.

Adaptive routing

NVIDIA Adaptive Routing is an advanced networking technology designed to optimize data flow in high-performance computing (HPC) and data center environments.

  • Dynamic Path Selection:

It dynamically selects the most efficient path for data packets to travel through the network. This selection is based on real-time traffic conditions and network congestion.

  • Real-Time Monitoring:

The technology continuously monitors the status of the network, allowing it to make informed decisions on the best paths for data transmission.

  • Avoiding Bottlenecks:

By rerouting data to avoid congested paths, Adaptive Routing helps prevent bottlenecks, ensuring smoother data flow.

Buy High-end NVIDIA Infiniband Products @Router-swith.com

SHIELD ( Self-Healing Interconnect)

It is designed to provide high-performance entertainment experiences, including streaming video, gaming, and smart home integration. Below are the key features and components of NVIDIA SHIELD:

Streaming Media Player:

  • 4K HDR Streaming: SHIELD supports 4K HDR content, providing stunning video quality for streaming services like Netflix, Amazon Prime Video, Disney+, and more.
  • AI Upscaling: It features AI-enhanced upscaling technology that can improve the quality of HD video content to near 4K resolution.

Gaming:

  • Cloud Gaming: NVIDIA SHIELD supports GeForce NOW, NVIDIA’s cloud gaming service that allows users to stream a vast library of PC games directly to the device.
  • Android Gaming: It also supports Android games, giving users access to a wide variety of titles from the Google Play Store.
  • GameStream: Users can stream games from their GeForce-powered PC to their SHIELD device with low latency, effectively turning their SHIELD into a powerful gaming console.

Smart Home Integration:

  • Google Assistant: SHIELD comes with built-in Google Assistant, enabling voice control for searching content, controlling smart home devices, and accessing various services.
  • SmartThings Hub: It can act as a SmartThings hub, allowing users to control compatible smart home devices directly from the SHIELD.

GPUDirect RDMA (Remote Direct Memory Access)

GPUDirect RDMA is an NVIDIA technology that enables direct communication between GPU memory and third-party devices (e.g., network adapters, storage devices) without involving the CPU.

Key Features:

  • Direct Data Transfer: Enables direct transfers between GPU memory and remote devices, bypassing the CPU.
  • Reduced Latency: Minimizes data transfer latency, crucial for real-time applications.
  • Improved Bandwidth: Leverages high-speed interconnects for higher throughput.
  • Enhanced Scalability: Supports multi-GPU and multi-node configurations for large-scale parallel computing.
  • Compatibility: Works with technologies like InfiniBand and RoCE.

Network topology design

Support multiple network topologies

● Fat-Tree/DragonFly+

Topology: FatTree and Dragonfly

 

Image Source:NVIDIA

● Fat-Tree vs DragonFly+

Fat-TreeDragonfly+
Applicable cluster size (non-blocking)Unlimited>800 nodes (For HDR)

>2K nodes (For NDR)

Network costHighModerate
Advantagesa.Simple networking

b.Network balance, simple routing algorithm

c.No risk of Credit loop deadlock

a. Low cost (saving switches and cables)

b.Flexible networking, easy to expand, suitable for phased construction

Applicable scenarios Unlimiteda.Large clusters built in phasesb.

b.Supercomputing cloud, multi-tenant sublease model cluster

Current status of industry applicationsMost widely used, with the highest market shareDomestic supercomputing customers have just started to practice DF+

Conclusion

NVIDIA InfiniBand has emerged as a critical technology for high-performance computing, offering unparalleled speed, low latency, and efficient data transfer. Understanding the basics of InfiniBand enables organizations to optimize their data centers and achieve superior network performance.

For those looking to invest in high-quality networking equipment, Router-switch.com is a trusted resource. Renowned for offering top-notch products at competitive prices for 22 years, like nvidia infiniband switches. We ensure to offer reliable and cost-effective solutions for your networking needs.

Share This Post

Post Comment