BDMPI - Big Data Message Passing Interface  Release 0.1
Running BDMPI Programs

For the purpose of the discussion in this section, we will use the simple "Hello world" BDMPI program that is located in the file test/helloworld.c of BDMPI's source distribution. This program is listed bellow:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <bdmpi.h>

int main(int argc, char *argv[])
{
  int i, myrank, npes;

  MPI_Init(&argc, &argv);

  MPI_Comm_size(MPI_COMM_WORLD, &npes);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

  for (i=0; i<npes; i++) {
    MPI_Barrier(MPI_COMM_WORLD);
    if (myrank == i) {
      printf("[%5d:%5d] Hello from rank: %2d out of %2d\n", 
          (int)getppid(), (int)getpid(), myrank, npes);
      fflush(stdout);
    }
  }
  MPI_Barrier(MPI_COMM_WORLD);

  MPI_Finalize();

  /* It is important to exit with a EXIT_SUCCESS status */
  return EXIT_SUCCESS; 
}

This program has its process print its parent and own process ID along with its rank in the MPI_WORLD_COMM and the size of that communicator. Note that the calls to MPI_Barrier() are used to ensure that the output from the different processes is displayed in an orderly fashion.

This program can be compiled by simply executing

bdmpicc -o helloworld test/helloworld.c

and is also been automatically built during BDMPI's build process and resides in the build/Linux-x86_64/test directory.


Running on a single node

BDMPI provides the bdmprun command to run BDMPI program on a single node. For example, the following command

bdmprun -ns 4 build/Linux-x86_64/test/helloworld

will create a single master process that will spawn four slave processes (i.e., due to -ns 4 option), each executing the helloworld program. Here is a sample output of the above execution:

[28933:28936] Hello from rank:  0 out of  4
[28933:28937] Hello from rank:  1 out of  4
[28933:28938] Hello from rank:  2 out of  4
[28933:28939] Hello from rank:  3 out of  4
[ bd4-umh: 28933]------------------------------------------------
[ bd4-umh: 28933]Master 0 is done.
[ bd4-umh: 28933]------------------------------------------------
[ bd4-umh: 28933]Master timings
[ bd4-umh: 28933]    totalTmr:      0.022s
[ bd4-umh: 28933]    routeTmr:      0.001s
[ bd4-umh: 28933]------------------------------------------------

The first four lines came from the helloworld program itself, whereas the remaining output lines came from bdmprun. Looking at this output we can see that all four processes share the same parent process, which corresponds to the master process, bdmprun. The master process reports the timing statistics at the end of the execution of helloworld.


Running on multiple nodes

In order for a program to run at multiple nodes, bdmprun needs to be combined with the mpiexec command that is provided by MPI. For example, the following command

mpiexec -hostfile ~/machines -np 3 bdmprun -ns 4 build/Linux-x86_64/test/helloworld

will start three master processes (i.e., due to -np 3) and each master process will spawn four slaves, resulting at a total of 12 processes executing the helloworld program. Here is a sample output of the above execution:

[  331:  334] Hello from rank:  0 out of 12
[  331:  335] Hello from rank:  1 out of 12
[  331:  336] Hello from rank:  2 out of 12
[  331:  337] Hello from rank:  3 out of 12
[27162:27165] Hello from rank:  4 out of 12
[27162:27166] Hello from rank:  5 out of 12
[27162:27167] Hello from rank:  6 out of 12
[27162:27168] Hello from rank:  7 out of 12
[23530:23533] Hello from rank:  8 out of 12
[23530:23534] Hello from rank:  9 out of 12
[23530:23535] Hello from rank: 10 out of 12
[23530:23536] Hello from rank: 11 out of 12
[ bd1-umh:   331]------------------------------------------------
[ bd1-umh:   331]Master 0 is done.
[ bd2-umh: 27162]------------------------------------------------
[ bd2-umh: 27162]Master 1 is done.
[ bd3-umh: 23530]------------------------------------------------
[ bd3-umh: 23530]Master 2 is done.
[ bd1-umh:   331]------------------------------------------------
[ bd1-umh:   331]Master timings
[ bd1-umh:   331]    totalTmr:      0.058s
[ bd1-umh:   331]    routeTmr:      0.001s
[ bd1-umh:   331]------------------------------------------------

Notice that the parent process IDs of the different BDMPI processes reveal the master-slave relation that exists (e.g., ranks 0–3, 4–7, and 8–11 have the same parent process). Also, the output lines generated by bdmprun provide information as to the machine on which the particular master process was executed (e.g., bd1-umh, bd2-umh, and bd3-umh) and overall timing information. The names of these machines are specified in the machines file that was provided as the -hostfile of mpiexec, following mpiexec's host machine specification guidelines (see MPICH documentation).

Do not configure your hostfile to have mpiexec start multiple processes on the same node. If you do, then each node will have multiple master processes running on it (i.e., multiple instances of bdmprun), each of which will have its own set of slave processes. If you wish to have multiple slave processes run at once on the same node, use the -nr option instead.


Options of bdmprun

There are a number of optional parameters that control bdmprun's execution. You can get a list of these options by simply executing bdmprun -h, which will display the following:

 
Usage: bdmprun [options] exefile [options for the exe-file]
 
 Required parameters
    exefile     The program to be executed.
 
 Optional parameters
  -ns=int [Default: 1]
     Specifies the number of slave processes on each node.
 
  -nr=int [Default: 1]
     Specifies the maximum number of concurrently running slaves.
 
  -nc=int [Default: 1]
     Specifies the maximum number of slaves in a critical section.
 
  -sm=int [Default: 20]
     Specifies the number of shared memory pages allocated for each slave.
 
  -im=int [Default: 4]
     Specifies the maximum size of a message that will be buffered
     in the memory of the master. Messages longer than that are buffered
     on disk. The size is in terms of memory pages.
 
  -mm=int [Default: 32]
     Specifies the maximum size of the buffer to be used by MPI for 
     inter-node communication. The size is in terms of memory pages.
 
  -sb=int [Default: 32]
     Specifies the size of allocations for which the explicit storage backed
     subsystem should be used. The size is in terms of number of pages and a
     value of 0 turns it off.
 
  -wd=string [Default: /tmp/bdmpi]
     Specifies where working files will be stored.
 
  -dl=int [Default: 0]
     Selects the dbglvl.
 
  -h
     Prints this message.

Options related to the execution environment

The -ns, -nr, -nc, and -wd options are used to control the execution environment of a BDMPI job.

The -ns option specifies the number of slaves that bdmprun will spawn on each node. These slaves then proceed to execute the provided BDMPI program (i.e., exefile). If multiple instances of bdmprun are started via mpiexec, then the total number of processes that are involved in the parallel execution of exefile is np*ns, where np is the number of processes specified in mpiexec and ns is the number of slaves supplied via the -ns option. In other words, the size of the MPI_COMM_WORLD communicator is np*ns.

The -nr options specifies the number of slave processes spawned by a single bdmprun process that are allowed to be executing concurrently. This option enables BDMPI's node-level cooperative multitasking execution model, which is designed to ensure that the aggregate memory requirements of the concurrently executing processes do not overwhelm the amount of available physical memory on the system. The default value for -nr is one, allowing only one slave to be executing per master process at a time. However, for multi-core systems, -nr can be increased as long as the aggregate amount of memory required by the concurrently running processes can comfortably fit within the available memory of the system. For example,

mpiexec -hostfile ~/machines -np 3 bdmprun -ns 4 -nr 2 build/Linux-x86_64/test/helloworld

will allow a maximum of two slave processes on each of the three nodes to be executing concurrently. The value specified for -nr should be less than or equal to -ns.

Besides the aggregate memory requirements, another issue that needs to be considered when the number of running processes is increased is the capability of the underlying I/O subsystem to handle concurrent I/O operations. On some systems, a faster execution can be obtained by allowing multiple slave processes to run concurrently but reduce the concurrency allowed during I/O operations. To facilitate this, BDMPI provides a pair of API functions that implement critical sections, which limit the number of processes that can be at a critical section at any give time. These are described in Intra-slave synchronization. The number of slave processes that are allowed to be running at a critical section is controlled by bdmprun's -nc option, which should be less than or equal to the value specified for -nr.

Finally, the -wd option specifies the name of the directory that BDMPI will use to store the various intermediate file that it generates. Since BDMPI's execution is primarily out-of-core, this directory should be on a drive/volume that has a sufficient amount of available storage and preferably the underlying hardware should support fast I/O.

Options related to memory resources

The -sm, -im, -mm, and -sb options are used to control various aspects of the execution of a BDMPI program as it relates to how it uses the memory. Among them, -im and -sb are the most important, whereas -sm and -mm are provided for fine tuning and most users will not need to modify them. Note that for all memory-related options, the size is specified in terms of memory pages and not bytes. On Linux, a memory page is typically 4096 bytes.

Since processes in BDMPI can be suspended and as such are not available to receive data sent to them, the master processes buffer data at the message's destination node. The -im parameter controls if the message will be buffered in DRAM or if it will buffered on disk. Messages whose size is smaller than the value specified by -im are buffered in memory, otherwise they are buffered on disk. The only exception to the above rule are the broadcast and the reduction operations, in which the buffering is always done in memory. Also note that in the case of collective communication operations, the size of the message is defined to be the amount of data that any processor will need to send/receive to/from any other processor.

As discussed in Efficient loading & saving of a process's address space, BDMPI uses its own dynamic memory allocation library that bypasses the system's swap file and preserves the data and virtual memory mappings throughout the execution of the program. The -sb parameter controls the size of the allocations that will be handled by the sbmalloc subsystem. Any allocation that is smaller than the value specified by -sb, is handled by the standard malloc() library, whereas the rest of the allocations are handled by sbmalloc. If you want to entirely disable sbmalloc, you can specify 0 as the value for this option.

The -sm option specifies the amount of shared memory to allocate for each slave process. This memory is used to facilitate fast communication between the master and the slave processes via memcpy(). The -mm option specifies the size of the buffer that a master process will allocate for inter-node communication. Note that if the data that needs to be communicated between the master and the slaves or other masters exceeds is greater than what is specified by these options, the transfer is done in multiple steps. Thus, the values of these options do not impact the correctness of the execution, but if they are relatively small, may lead to (somewhat) higher communication cost.

The remaining options

The last two options, -dl and -h, are used to turn on various reporting messages and display bdmprun's help page. Setting -dl to anything else than 0 can potentially generate a lot of reporting messages and is not encouraged. It is there primarily for debugging BDMPI during its development.