BDMPI - Big Data Message Passing Interface  Release 0.1
Guidelines for Developing Efficient BDMPI Programs

In this section we provide some guidelines for developing parallel MPI programs that leverage BDMPI's execution and memory model in order to achieve good out-of-core execution.

Use collective communication operations

BDMPI's implementation of MPI's collective communication operations has been optimized so that a process becomes runnable only when all the data that it requires is locally available. As a result, when it is scheduled for execution it will proceed to get any data that it may require and resume execution without needing to block for the same collective operation. This approach minimizes the time spent in saving/restoring to/from disk the active parts of the processes' address space, resulting in fast out-of-core execution. For this reason, the application should try to structure its communication patterns so that it uses the collective communication operations.

Group blocking communication operations together

When a running processes is blocked due to a blocking MPI operation, the part of the address space that it accessed since the last time it was scheduled will most likely be unmapped from physical memory. When the process' blocking condition is lifted (e.g., received the data that it was waiting for) and is scheduled for execution, it will load the parts of the address space associated with the computations that it will perform until the next time it blocks. Note that even if a process requires to access the same parts of the address space as in the previous step, because of the earlier unmapping, they still need to be remapped from disk to physical memory.

The cost of these successive unmapping/remapping operations from/to the physical memory can potentially be reduced by restructuring the computations so that if an application needs to perform multiple blocking communication operations, it performs them one-after-the-other with little computations between them. Of course, such restructuring may not always be possible, but if it can be done, it will lead to considerable performance improvements.

Freeing scratch memory

If a process, after returning from a blocking communication operation, proceeds to overwrite some of the memory that it allocated previously without first reading from it, then the cost associated with saving them to disk and restoring them from disk is entirely wasted. In such cases, it is better to free that memory prior to performing a blocking communication operation and re-allocating when returning from it. Alternatively, if the allocation is handled by the sbmalloc subsystem (Efficient loading & saving of a process's address space), an application can use the BDMPI_sbdiscard() function (Storage-backed memory allocations) to inform the sbmalloc subsystem that the memory associated with the provided allocation does not need to be saved and restored during the next block/resume cycle.

Structure memory allocations in terms of active blocks

As discussed in Efficient loading & saving of a process's address space, when the application accesses an address from the memory area that was allocated by the sbmalloc subsystem, BDMPI loads in physical memory the entire allocation that contains that address (i.e., all the memory that was allocated as part of the malloc() call that allocated the memory containing that address). Given this, the application should structure computations and memory allocations so that it is accessing most of the loaded data prior to performing a blocking operation. This will often involve breaking the memory allocations into smaller segments that include just the elements that will be accesses and potentially restructuring the computations so that they exhibit a segment-based spatial locality (i.e., if they access some data in an allocated segment, then they will most likely access all/most of the data in that segment).