Performance Evaluation of RMA Operations with Requests and Merged Requests Using Micro-Benchmarks
Latency of Get Operations
In this experiment, the performance of shmem.getmem, shmem.getmem.nbi, and shmem.getmem.nbe operations is compared. The origin PE issues the Get operation, and waits for completion. In case of shmem.getmem, the data is updated when the call returns. In the case of shmem.getmem.nbi, and shmem.getmem.nbe, it waits for shmern.quiet and shmem.wait.req to complete respectively. Figure 2 shows that the latency of all Get operations are similar.
To understand the performance impact of global completion (shmem .quiet and shmem.barrier) used for completing implicit operations, we modify the Get benchmark to issue multiple Get operations. The origin PE issues Get operations to multiple PEs, and waits for completion only on one PE. From Fig. 3 we observe that the performance of RMA operations with requests outperform (as expected) both implicit non-blocking RMA operations and blocking RMA operations.