The annual OpenFabrics Alliance (OFA) Workshop is a premier means of fostering collaboration among those in the OpenFabrics community and advanced networking industry as a whole. Known for being the only event of its kind, the OFA Workshop allows attendees to discuss emerging fabric technologies, collaborate on future industry requirements, and address remaining challenges. The week-long event is made up of sessions covering a wide range of pressing topics, including talks related to InfiniBand and RDMA over Converged Ethernet (RoCE).
This year’s agenda featured sessions highlighting a variety of InfiniBand and RoCE updates and emerging applications. Below is a list of all OFA Workshop 2018 sessions covering RDMA technologies and the associated presentations.
Parav Pandit, Mellanox Technologies
Using RDMA in containerized environment in a secure manner is desired. RDMA over Converged Ethernet (RoCE) needs to operate and honor net namespace other than default init_net. This session focused on recent and upcoming enhancements for functionality and security for RoCE. Various modules of the InfiniBand stack including connection manager, user verbs, core, statistics, resource tracking, device discovery and visibility to applications, net device migration across namespaces at minimum are the key areas to address for supporting RoCE devices in container environment.
Xiaoyi Lu, The Ohio State University
Single Root I/O Virtualization (SR-IOV) technology has been steadily gaining momentum for high performance interconnects such as InfiniBand. SR-IOV can deliver near-native performance but lacks locality-aware communication support. This talk presented an efficient approach to building HPC clouds based on MVAPICH2 and RDMA-Hadoop with SR-IOV. The talk highlighted high-performance designs of the virtual machine and container aware MVAPICH2 library over SR-IOV enabled HPC Clouds. This talk also presented a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The presenter discussed how to leverage the high-performance networking features (e.g., RDMA, SR-IOV) on cloud environments to accelerate data processing through RDMA-Hadoop package. To show the performance benefits of the proposed designs, they co-designed a scalable and distributed tool with MVAPICH2 for statistical evaluation of brain connectomes in the Neuroscience domain, which can run on top of container-based cloud environments with natively utilizing RDMA interconnects and delivering near-native performance.
Tzahi Oved, Mellanox Technologies
Memory registration enables contiguous memory regions to be accessed with RDMA. In this talk, they showed how this could be extended beyond access rights, for describing complex memory layouts. Many HPC applications receive regular structured data, such as a column of a matrix. In this case, the application would typically receive a chunk of data and scatter it by the CPU, or use multiple RDMA writes to transfer each element in-place. Both options introduce significant overhead. By using a memory region that specifies strided access, this overhead could be completely eliminated: the initiator posts a single RDMA write and the target HCA scatters each element into place. Similarly, standard memory regions cannot describe non-contiguous memory allocations, forcing applications to generate remote keys for each buffer. However, by allowing a non-contiguous memory region to span multiple address ranges, an application may scatter remote data with a single remote key. Using non-contiguous memory registration, such memory layouts may be created, accessed, and invalidated using efficient, non-privileged, user-level interfaces.
Alex Rosenbaum, Mellanox Technologies
Dynamically-Connected (DC) transport is a combination of features from the existing UD and RC transports: DC can send every message to a different destination, like UD does, and is also a reliable transport – supporting RDMA and Atomic operations as RC does. The crux of the transport is dynamically connecting and disconnecting on-the-fly in hardware when changing destinations. As a result, a DC endpoint may communicate with any peer, providing the full RC feature set, and maintain a fixed memory footprint regardless of the size of the network. In this talk, we present the unique characteristics of this new transport, and show how it could be leveraged to reach peek all-to all communication performance. We will review the DC transport objects and their semantics, the Linux upstream DC API and its usage.
Tzahi Oved, Mellanox Technologies
T10-DIF is a standard that defines how to protect the integrity of storage data blocks. Every storage block is proceeded by a Data Integrity Field (DIF). This field contains CRC of the preceding block, the LBA (block number within the storage device) and an application tag. Normally the DIF will be saved in the storage device along with the data block itself, so that in the future it will be used to verify the data integrity.
Modern storage systems and adapters allow creating, verifying and stripping those DIFs while reading and writing data to the storage device, as requested by the user and supported by the OS. The T10-DIF offload RDMA feature brings this capability to the RDMA based storage protocols. Using this feature, RDMA based protocols can request the RDMA device to generate, strip and/or verify DIF while sending or receiving a message. DIF operation is configured in a new Signature Memory-Region. Every memory access using this MR (local or remote) results in DIF operation done on the data as it moves between wire and memory. This session will describe how the configuration and operation of this feature should be done using verbs API.
Liran Liss, Mellanox Technologies
NVMe is a standard that defines how to access a solid-state storage device over PCI in a very efficient way. It defines how to create and use multiple submission and completion queues between software and the device over which storage operations are carried and completed.
NVMe-over-Fabric is a newer standard that maps NVMe to RDMA to allow remote access to storage devices over an RDMA fabric using the same NVMe language. Since NVMe queues look and act very much like RDMA queues, it is a natural application to bridge between the two. In fact, a couple of software packages today implement an NVMe-over-Fabric to local NVMe target.
The NVMe-oF Target Offload feature is such an implementation that is done in hardware. A supporting RDMA device is configured with the details of the queues of an NVMe device. An incoming client RDMA connection (QP) is then bound to those NVMe queues. From that point on, every IO request arriving over the network from the client is submitted to the respective NVMe queue without any software intervention using PCI peer-to-peer access. This session will describe how the configuration and operation of such feature should be done using verbs.
Xiaoyi Lu, The Ohio State University
The convergence of Big Data and HPC has been pushing the innovation of accelerating Big Data analytics and management on modern HPC clusters. Recent studies have shown that the performance of Apache Hadoop, Spark, and Memcached can be significantly improved by leveraging the high performance networking technologies, such as Remote Direct Memory Access (RDMA). Most of these studies are based on `DRAM+RDMA’ schemes. On the other hand, Non-Volatile Memory (NVM)and NVMe-SSD technologies can support RDMA access with low-latency, high-throughput, and persistence on HPC clusters. NVMs and NVMe-SSDs provide the opportunity to build novel high-performance and QoS-aware communication and I/O subsystems for data-intensive applications. In this talk, we proposed new communication and I/O schemes for these data analytics stacks, which are designed with RDMA over NVM and NVMe-SSD. Our studies show that the proposed designs can significantly improve the communication, I/O, and application performance for Big Data analytics and management middleware, such as Hadoop, Spark, Memcached, etc. In addition, we will also discuss how to design QoS-aware schemes in these frameworks with NVMe-SSD.
Michael Aguilar, Sandia National Laboratories
In this presentation, we showed InfiniBand performance information gathered from a large Sandia HPC system, Skybridge. We showed detection of network hot spots that may affect data exchanges for tightly coupled parallel threads. We quantified the overhead cost (application impact) when data is being collected.
At Sandia Labs, we are continuing to develop an InfiniBand fabric switch port sampler that can used to gather remote data from InfiniBand switches. Using coordinated InfiniBand switch and HCA port samplers, a real-time snapshot of InfiniBand traffic can be retrieved from the fabric on a large-scale HPC computing platform. Due to the time-stamped and light-weight data retrieval with LDMS, production job runs can be instrumented to provide research data that can be used to specify computing platforms with improved data performance.
Our implementation of synchronous monitoring of large-scale HPC systems provides insights into how to improve computing performance. Our sampler takes advantage of the OpenFabrics software stack for metric gathering. The OFED stack supports a common inter-operable software stack that provides the inherent ability to gather traffic metrics from selected connection points within a network fabric. We use OFED MAD and UMAD to collect the remote switch port traffic metrics.
The OFA Workshop is extremely valuable to InfiniBand Trade Association members and the fabrics community as a whole with an aim to identify, discuss and overcome the industry’s most significant challenges. We look forward to participating again next year. Videos of each presentation from the OFA Workshop 2018 are now available online on insideHPC.com.