MPI Apps

Note

Parsl support for MPI Apps is being re-engineered. We describe the best practices with today’s Parsl. Join our Slack if you want to steer Parsl’s future.

MPI applications run multiple copies of a program that complete a single task by coordinating using messages passed within or across nodes. Starting MPI application requires invoking a “launcher” code (e.g., mpiexec) from one node with options that define how the copies of a program should be distributed to others. The need to call a launcher from a specific node requires a special configuration of Parsl Apps and Execution environments to run Apps which use MPI.

HTEx and MPI Tasks

The HighThroughputExecutor (HTEx) is the default execution provider available through Parsl. Parsl Apps which invoke MPI code require a dedicated HTEx configured such that every Parsl app will have access to any of the nodes available within a block, and Apps that will invoke the MPI launcher with the correct settings.

Configuring the Provider

Parsl must be configured to deploy workers on exactly one node per block. This part is simple. Instead of defining a launcher which will place an executor on each node in the block, simply use the SimpleLauncher.

It is also necessary to specify the desired number of blocks for the executor. Parsl cannot determine the number of blocks needed to run a set of MPI Tasks, so they must bet set explicitly (see Issue #1647). The easiest route is to set the max_blocks and min_blocks of the provider to the desired number of blocks.

Configuring the Executor

Configure the executor to launch a number of workers equal to the number of MPI tasks per block. First set the max_workers to the number of MPI Apps per block. then set cores_per_worker=1e-6 to prevent HTEx from reducing the number of workers if you request more workers than cores.

If you plan to only launch one App per block, you are done!

If not, the executor may need to first partition nodes into distinct groups for each MPI task. Partitioning is necessary if your MPI launcher does not automatically ensure MPI task gets exclusive nodes. Resources vary in how the list of available nodes is provided, but they typically are found as a “hostfile” referenced in an environment variable (e.g., PBS_NODEFILE). We recommend splitting this hostfile for the host block into hostfiles for each worker by adding code like the following to your worker_init:

NODES_PER_TASK=2
mkdir -p /tmp/hostfiles/
split --lines=$NODES_PER_TASK -d --suffix-length=2 $PBS_NODEFILE /tmp/hostfiles/hostfile.

Example Configuration

An example for an executor which runs MPI tasks on ALCF’s Polaris supercomputer (HPE Apollo, PBSPro resource manager) is below.

nodes_per_task = 2
tasks_per_block = 16
config = Config(
    executors=[
        HighThroughputExecutor(
            label='mpiapps',
            address=address_by_hostname(),
            start_method="fork",  # Needed to avoid interactions between MPI and os.fork
            max_workers=tasks_per_block,
            cores_per_worker=1e-6, # Prevents
            provider=PBSProProvider(
                account="ACCT",
                worker_init=f"""
# Prepare the computational environment
module swap PrgEnv-nvhpc PrgEnv-gnu
module load conda
module list
conda activate /lus/grand/projects/path/to/env
cd $PBS_O_WORKDIR

# Print the environment details for debugging
hostname
pwd
which python

# Prepare the host files
mkdir -p /tmp/hostfiles/
split --lines={nodes_per_task} -d --suffix-length=2 $PBS_NODEFILE /tmp/hostfiles/hostfile.""",
                walltime="6:00:00",
                launcher=SimpleLauncher(),  # Launches only a single executor per block
                select_options="ngpus=4",
                nodes_per_block=nodes_per_task * tasks_per_block,
                min_blocks=0,
                max_blocks=1,
                cpus_per_node=64,
            ),
        ),
    ]
)

Writing MPI-Compatible Apps

The App can be either a Python or Bash App which invokes the MPI application.

In the easiest case (i.e., single MPI task per block), write the MPI launcher options in the string returned by the bash app or as part of a subprocess call from a Python app.

@bash_app
def echo_hello(n: int, stderr='std.err', stdout='std.out'):
    return f'mpiexec -n {n} --ppn 1 hostname'

Complications arise when running more than one MPI task per block, and the MPI launcher does not automatically spread jobs across nodes. In this case, use the PARSL_WORKER_RANK environment variable set by HTEx to select the correct hostfile:

@bash_app
def echo_hello(n: int, stderr='std.err', stdout='std.out'):
    return (f'mpiexec -n {n} --ppn 1 '
            '--hostfile /tmp/hostfiles/local_hostfile.`printf %02d $PARSL_WORKER_RANK` '
            'hostname')

Note

Use these Apps for testing! Submit many task using one of these Apps then ensure the number of unique nodes in the “std.out” files is the same as the number per block.

Limitations

Support for MPI tasks in HTEx is limited:

  1. All tasks must use the same number of nodes, which is fixed when creating the executor.

  2. MPI tasks may not span across nodes from more than one block.

  3. Parsl does not correctly determine the number of execution slots per block (Issue #1647)

  4. The executor uses a Python process per task, which can use a lot of memory (Issue #2264)