Collective Communications
- Overview
- Barrier Synchronization
- Data Movement
- Broadcast
- Gather
- Scatter
- All to All Scatter/Gather
- Global Reduction
- Reduce
- All Reduce
1.0 Overview
As the name implies, collective communications refers to those MPI functions
involving all the processors within the defined communicator group.
Collective communications are mostly built around point-to-point
communications. Several features distinguish collective communications from
point-to-point communications.
- A collective operation requires that all processes within the
communicator group call the same collective communication function with
matching arguments.
- The size of data sent must exactly match the
size of data received. In
point-to-point communications, a sender buffer may be smaller than the
receiver buffer. In collective communications they must be the
same.
- Except for synchronization routines, MPI collective communication
functions are not synchronizing as set by the MPI standards.
- Collective communications exist in blocking mode only. Blocking here
means that a process will block until its role in the collective
communication is complete, no matter what the
completion status is of the other processes
participating in the communications.
- Collective operations do not use the tag field. They are matched
according to
the order they are executed.
Collective communications are divided into three categories according to
function:
- synchronization
- data movement
- global reduction operations
2.0 Barrier Synchronization
MPI_Barrier synchronizes all processes in the communicator calling this
function. A process calling MPI_Barrier blocks until all other
processes in the communicator group have called it. Then they all
proceed.
| FORTRAN |
MPI_BARRIER(comm, ierr) |
| C |
MPI_Barrier(comm) |
3.0 Data Movement
MPI provides several types of routines for handling collective data
movement:
- Broadcast -- from one processor to all members of the communicator
- Gather -- to collect data from all members in the group to one
processor
- Scatter -- to scatter data from one member to all members of the
group
- Variations of gather/scatter from all members to all members
3.1 Broadcast
| FORTRAN |
MPI_BCAST(buff,count,datatype,root,comm,ierr) |
| C |
MPI_Bcast(buff,count,datatype,root,comm) |
The function MPI_Bcast broadcasts a message from process root to
all other processes in the communicator group comm, including
itself. All calling processes must specify the same root process.
When the call returns, the contents of the buffer on the root
process is copied to the buffer of all other processes in the
group. The figure below shows the behavior of calling MPI_Bcast.

3.2 Gather
| FORTRAN |
MPI_GATHER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,root,comm,ierr) |
| C |
MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,root,comm) |
When called by every process, each process sends the contents of
its send buffer to the root process, which then receives those
messages and stores them in its receive buffer according to the
rank order of the sender. For example, process 0 goes to location
0 of root's receive buffer, process n's message
goes to location n
of the root receive buffer. The send buffer of all processes is
used to send the messages, but only the receive buffer of the root
is used.
NOTE: The argument recvcount on root indicates the number of
elements it is receiving from each process, not the total number
of elements the root receives.
A variation of MPI_Gather called MPI_Allgather works the same
way except that all the processes in the communicator receive the
result, not only the root. MPI_Allgather can be thought of as an
MPI_Gather operation followed by MPI_Bcast by the root to all
processes.
| FORTRAN |
MPI_ALLGATHER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm,ierr) |
| C |
MPI_Allgather(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm) |

3.3 Scatter
| FORTRAN |
MPI_SCATTER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,root,comm,ierr) |
| C |
MPI_Scatter(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,root,comm) |
The function MPI_Scatter performs the reverse operation of the
function MPI_Gather described above. (See the image above.)
The root process divides its
send buffer into n equal segments and sends segment 1 to process
of rank 1, segment 2 to process of rank 2 and segment n to process
of rank n in the communicator group. Each process receives a
segment from the root and places it in its receive buffer.
The receive buffer is used on all processes, but only the send
buffer of the root process is used. As in MPI_Gather, the rank of
root must be the same in all calls to MPI_Scatter.
3.4 All to All Scatter/Gather
| FORTRAN |
MPI_ALLTOALL(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm,ierr) |
| C |
MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm) |
The function MPI_Alltoall works like MPI_Allgather, except
that each process sends a distinct data to each of the receivers.
This routine is very helpful for transposing a matrix that is
distributed among several processors. The figure below shows the
original array on the left and the resulting array on the right when
all processors call MPI_Alltoall. This is useful when transposing a
distributed matrix.

4.0 Global Reduction Operations
The MPI global reduction functions perform a global reduce operation
across all members of the communicator group. The MPI standards define
a set of predefined operations that can be used and also provide
tools for programmers to define their own reduce
operations.
The table below lists the predefined operations that can be be used
with MPI_Reduce and related reduction functions.
Name of
Operation |
Meaning |
| MPI_MAX |
maximum |
| MPI_MIN |
minimum |
| MPI_SUM |
sum |
| MPI_PROD |
product |
| MPI_LAND |
logical and |
| MPI_BAND |
bit-wise and |
| MPI_LOR |
logical or |
| MPI_BOR |
bit-wise or |
| MPI_LXOR |
logical xor |
| MPI_BXOR |
bit-wise xor |
| MPI_MAXLOC |
max value and location |
| MPI_MINLOC |
minimum value and location |
4.1 Reduce
| FORTRAN |
MPI_REDUCE(sendbuf,recvbuf,count,datatype,
op,root,comm,ierr) |
| C |
MPI_Reduce(sendbuf,recvbuf,count,datatype,
op,root,comm) |
When called by each member of the communicator group, MPI_Reduce
combines the data in the send buffer of each process using the MPI
operation op and stores the resulting reduced data in the receive
buffer of the root process. All members must call MPI_Reduce with
the same root, op, and count.
The reduction algorithm does not guarantee the order through
which data are reduced from each member. Therefore, the send and
receive buffers must be separate.
Example 1 -- Calculating pi
Another version of the program pi.f
can be written using
the collective communication routines MPI_Bcast and
MPI_reduce. These collective routines replace the
two functions of MPI_Send and MPI_Recv.
Example 2 -- Matrix Multplication
This simple example, matmult.f,
uses MPI point-to-point and
collective communications to multiply two small matrices.
4.2 All Reduce
| FORTRAN |
MPI_ALLREDUCE(sendbuf,recvbuf,count,datatype,
op,comm,ierr) |
| C |
MPI_Allreduce(sendbuf,recvbuf,count,datatype,
op,comm) |
The function MPI_Allreduce is similar to MPI_Reduce except that
all members of the communicator group receive the reduced result.
Calling MPI_Allreduce is similar to calling MPI_Reduce followed by
MPI_Bcast by the root to all members.
|