We run our experiments on a 16 node SGI cluster with Mellanox ConnectX-4 VPI adapter card, EDR IB (100 Gb/s) and 100 GbE, a single-port QSFP, and PCIe3.0 x 16. Each node comprises of two NUMA nodes with two sockets each and 10 cores per socket. Each of 40 CPUs is an Intel Xeon E5-2660 v3 operating at 2.6 GHz.
Application Kernels and Benchmarks
For evaluation, we use micro-benchmarks and application kernels. The details of the application kernel is provided in Sect. 5.3. Here we present the experimental results and discuss them in detail in Sect. 6.
For evaluating the latency, bandwidth, and message rate, we modify benchmarks from OSU . The modifications include changing the shmem interfaces to use non-blocking implicit and explicit RMA operations.
We modify the latency benchmark to mimic a ping-pong exchange. The ping- pong benchmark first sends the data from origin PE to remote PE. The remote PE waits for data using shmem wait on the last byte of the data, then sends a response to the origin PE. Though this approach may not reflect the arrival of the complete message for networks that do not guarantee in-order delivery, for Mellanox’s InfiniBand network with Reliable Connection (RC) transport protocol, in-order delivery is guaranteed.