NCSA Home
Contact Us | Intranet | Search

ncsa

Collective Communications

  1. Overview
  2. Barrier Synchronization
  3. Data Movement
    1. Broadcast
    2. Gather
    3. Scatter
    4. All to All Scatter/Gather
  4. Global Reduction
    1. Reduce
    2. All Reduce


1.0 Overview

As the name implies, collective communications refers to those MPI functions involving all the processors within the defined communicator group. Collective communications are mostly built around point-to-point communications. Several features distinguish collective communications from point-to-point communications.
  • A collective operation requires that all processes within the communicator group call the same collective communication function with matching arguments.

  • The size of data sent must exactly match the size of data received. In point-to-point communications, a sender buffer may be smaller than the receiver buffer. In collective communications they must be the same.

  • Except for synchronization routines, MPI collective communication functions are not synchronizing as set by the MPI standards.

  • Collective communications exist in blocking mode only. Blocking here means that a process will block until its role in the collective communication is complete, no matter what the completion status is of the other processes participating in the communications.

  • Collective operations do not use the tag field. They are matched according to the order they are executed.
Collective communications are divided into three categories according to function:
  1. synchronization
  2. data movement
  3. global reduction operations

2.0 Barrier Synchronization

MPI_Barrier synchronizes all processes in the communicator calling this function. A process calling MPI_Barrier blocks until all other processes in the communicator group have called it. Then they all proceed.
FORTRAN MPI_BARRIER(comm, ierr)
C MPI_Barrier(comm)

3.0 Data Movement

MPI provides several types of routines for handling collective data movement:
  • Broadcast -- from one processor to all members of the communicator
  • Gather -- to collect data from all members in the group to one processor
  • Scatter -- to scatter data from one member to all members of the group
  • Variations of gather/scatter from all members to all members

3.1 Broadcast

FORTRAN MPI_BCAST(buff,count,datatype,root,comm,ierr)
C MPI_Bcast(buff,count,datatype,root,comm)
The function MPI_Bcast broadcasts a message from process root to all other processes in the communicator group comm, including itself. All calling processes must specify the same root process. When the call returns, the contents of the buffer on the root process is copied to the buffer of all other processes in the group. The figure below shows the behavior of calling MPI_Bcast.

3.2 Gather

FORTRAN MPI_GATHER(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,root,comm,ierr)
C MPI_Gather(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,root,comm)
When called by every process, each process sends the contents of its send buffer to the root process, which then receives those messages and stores them in its receive buffer according to the rank order of the sender. For example, process 0 goes to location 0 of root's receive buffer, process n's message goes to location n of the root receive buffer. The send buffer of all processes is used to send the messages, but only the receive buffer of the root is used.

NOTE: The argument recvcount on root indicates the number of elements it is receiving from each process, not the total number of elements the root receives.

A variation of MPI_Gather called MPI_Allgather works the same way except that all the processes in the communicator receive the result, not only the root. MPI_Allgather can be thought of as an MPI_Gather operation followed by MPI_Bcast by the root to all processes.
FORTRAN MPI_ALLGATHER(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,comm,ierr)
C MPI_Allgather(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,comm)

3.3 Scatter

FORTRAN MPI_SCATTER(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,root,comm,ierr)
C MPI_Scatter(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,root,comm)
The function MPI_Scatter performs the reverse operation of the function MPI_Gather described above. (See the image above.) The root process divides its send buffer into n equal segments and sends segment 1 to process of rank 1, segment 2 to process of rank 2 and segment n to process of rank n in the communicator group. Each process receives a segment from the root and places it in its receive buffer.

The receive buffer is used on all processes, but only the send buffer of the root process is used. As in MPI_Gather, the rank of root must be the same in all calls to MPI_Scatter.

3.4 All to All Scatter/Gather

FORTRAN MPI_ALLTOALL(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,comm,ierr)
C MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf, recvcount,recvtype,comm)
The function MPI_Alltoall works like MPI_Allgather, except that each process sends a distinct data to each of the receivers. This routine is very helpful for transposing a matrix that is distributed among several processors. The figure below shows the original array on the left and the resulting array on the right when all processors call MPI_Alltoall. This is useful when transposing a distributed matrix.

4.0 Global Reduction Operations

The MPI global reduction functions perform a global reduce operation across all members of the communicator group. The MPI standards define a set of predefined operations that can be used and also provide tools for programmers to define their own reduce operations.

The table below lists the predefined operations that can be be used with MPI_Reduce and related reduction functions.

Name of
Operation
Meaning
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical and
MPI_BAND bit-wise and
MPI_LOR logical or
MPI_BOR bit-wise or
MPI_LXOR logical xor
MPI_BXOR bit-wise xor
MPI_MAXLOC max value and location
MPI_MINLOC minimum value and location

4.1 Reduce

FORTRAN MPI_REDUCE(sendbuf,recvbuf,count,datatype, op,root,comm,ierr)
C MPI_Reduce(sendbuf,recvbuf,count,datatype, op,root,comm)
When called by each member of the communicator group, MPI_Reduce combines the data in the send buffer of each process using the MPI operation op and stores the resulting reduced data in the receive buffer of the root process. All members must call MPI_Reduce with the same root, op, and count.

The reduction algorithm does not guarantee the order through which data are reduced from each member. Therefore, the send and receive buffers must be separate.

Example 1 -- Calculating pi
Another version of the
program pi.f can be written using the collective communication routines MPI_Bcast and MPI_reduce. These collective routines replace the two functions of MPI_Send and MPI_Recv.
Example 2 -- Matrix Multplication
This simple example, matmult.f, uses MPI point-to-point and collective communications to multiply two small matrices.

4.2 All Reduce

FORTRAN MPI_ALLREDUCE(sendbuf,recvbuf,count,datatype, op,comm,ierr)
C MPI_Allreduce(sendbuf,recvbuf,count,datatype, op,comm)

The function MPI_Allreduce is similar to MPI_Reduce except that all members of the communicator group receive the reduced result. Calling MPI_Allreduce is similar to calling MPI_Reduce followed by MPI_Bcast by the root to all members.