Figures 4 and 5 compare the round trip time for shmem.put, shmem.put.nbi, and shmem.put.nbe for small and large messages respectively. The origin PE sends a ping using shmemjput, shmem.put.nbi, or shmem.put.nbe, and the destination
Fig. 2. Comparing performance of shmem.getmem, shmem.getmem.nbi, and shmem. getmem.nbe
Fig. 3. Comparing performance of OpenSHMEM OSU shmem get many benchmark using 64 PEs
Fig. 4. Roundtrip latency using put-based ping-pong benchmark for small messages
PE and then waits on a corresponding pong using shmemJntwaiLuntil. On receiving the ping, the destination PE responds with a pong through a Put. The target PE waits on the last byte of the message.
Fig. 5. Roundtrip latency using put-based ping-pong benchmark for large messages
For our experiments we use Mellanox’s InfiniBand HCA as network and use RC protocol for data transfer, which guarantees in-order delivery of messages. For this setup, polling on the last byte of data to learn the completion is a reasonable approach, although it might be inaccurate for networks and memory architectures that do not guarantee in-order delivery of messages. For completion, the shmemjput and shmemjputjnbi calls require a shmem^quiet, while shmemjput-nbe requires a shmemjwaitjreq on the request.
From the graphs, one can observe that there are some performance differences. For a one byte message, the round trip latencies of shmem_put, shmemjputjnbi, and shmemjputjnbe are 1.58 psec, 1.54 psec, and 1.52 psec respectively. For 4 MB message, the latencies are 753.29 psec, 704.54 psec, and 685.65 psec respectively. The performance difference in case of small message is negligible.