MPI: blocking vs non-blocking
I am having trouble understanding the concept of blocking communication and non-blocking communication in MPI. What are the differences between the two? What are the advantages and disadvantages?
Solution 1:
Blocking communication is done using MPI_Send()
and MPI_Recv()
. These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send()
can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv()
returns when the receive buffer has been filled with valid data.
In contrast, non-blocking communication is done using MPI_Isend()
and MPI_Irecv()
. These function return immediately (i.e., they do not block) even if the communication is not finished yet. You must call MPI_Wait()
or MPI_Test()
to see whether the communication has finished.
Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call MPI_Isend()
, do some computations, then do MPI_Wait()
. This allows computations and communication to overlap, which generally leads to improved performance.
Note that collective communication (e.g., all-reduce) is only available in its blocking version up to MPIv2. IIRC, MPIv3 introduces non-blocking collective communication.
A quick overview of MPI's send modes can be seen here.
Solution 2:
This post, although is a bit old, but I contend the accepted answer. the statement " These functions don't return until the communication is finished" is a little misguiding because blocking communications doesn't guarantee any handshake b/w the send and receive operations.
First one needs to know, send has four modes of communication : Standard, Buffered, Synchronous and Ready and each of these can be blocking and non-blocking
Unlike in send, receive has only one mode and can be blocking or non-blocking .
Before proceeding further, one must also be clear that I explicitly mention which one is MPI_Send\Recv buffer and which one is system buffer( which is a local buffer in each processor owned by the MPI Library used to move data around among ranks of a communication group)
BLOCKING COMMUNICATION : Blocking doesn't mean that the message was delivered to the receiver/destination. It simply means that the (send or receive) buffer is available for reuse. To reuse the buffer, it's sufficient to copy the information to another memory area, i.e the library can copy the buffer data to own memory location in the library and then, say for e.g, MPI_Send can return.
MPI standard makes it very clear to decouple the message buffering from send and receive operations. A blocking send can complete as soon as the message was buffered, even though no matching receive has been posted. But in some cases message buffering can be expensive and hence direct copying from send buffer to receive buffer might be efficient. Hence MPI Standard provides four different send modes to give the user some freedom in selecting the appropriate send mode for her application. Lets take a look at what happens in each mode of communication :
1. Standard Mode
In the standard mode, it is up to the MPI Library, whether or not to buffer the outgoing message. In the case where the library decides to buffer the outgoing message, the send can complete even before the matching receive has been invoked. In the case where the library decides not to buffer (for performance reasons, or due to unavailability of buffer space), the send will not return until a matching receive has been posted and the data in send buffer has been moved to the receive buffer.
Thus MPI_Send in standard mode is non-local in the sense that send in standard mode can be started whether or not a matching receive has been posted and its successful completion may depend on the occurrence of a matching receive ( due to the fact it is implementation dependent if the message will be buffered or not) .
The syntax for standard send is below :
int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
2. Buffered Mode
Like in the standard mode, the send in buffered mode can be started irrespective of the fact that a matching receive has been posted and the send may complete before a matching receive has been posted. However the main difference arises out of the fact that if the send is stared and no matching receive is posted the outgoing message must be buffered. Note if the matching receive is posted the buffered send can happily rendezvous with the processor that started the receive, but in case there is no receive, the send in buffered mode has to buffer the outgoing message to allow the send to complete. In its entirety, a buffered send is local. Buffer allocation in this case is user defined and in the event of insufficient buffer space, an error occurs.
Syntax for buffer send :
int MPI_Bsend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
3. Synchronous Mode
In synchronous send mode, send can be started whether or not a matching receive was posted. However the send will complete successfully only if a matching receive was posted and the receiver has started to receive the message sent by synchronous send. The completion of synchronous send not only indicates that the buffer in the send can be reused, but also the fact that receiving process has started to receive the data. If both send and receive are blocking then the communication does not complete at either end before the communicating processor rendezvous.
Syntax for synchronous send :
int MPI_Ssend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
4. Ready Mode
Unlike the previous three mode, a send in ready mode can be started only if the matching receive has already been posted. Completion of the send doesn't indicate anything about the matching receive and merely tells that the send buffer can be reused. A send that uses ready mode has the same semantics as standard mode or a synchronous mode with the additional information about a matching receive. A correct program with a ready mode of communication can be replaced with synchronous send or a standard send with no effect to the outcome apart from performance difference.
Syntax for ready send :
int MPI_Rsend(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
Having gone through all the 4 blocking-send, they might seem in principal different but depending on implementation the semantics of one mode may be similar to another.
For example MPI_Send in general is a blocking mode but depending on implementation, if the message size is not too big, MPI_Send will copy the outgoing message from send buffer to system buffer ('which mostly is the case in modern system) and return immediately. Lets look at an example below :
//assume there are 4 processors numbered from 0 to 3
if(rank==0){
tag=2;
MPI_Send(&send_buff1, 1, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff2, 1, MPI_DOUBLE, 2, tag, MPI_COMM_WORLD);
MPI_Recv(&recv_buff1, MPI_FLOAT, 3, 5, MPI_COMM_WORLD);
MPI_Recv(&recv_buff2, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
else if(rank==1){
tag = 10;
//receive statement missing, nothing received from proc 0
MPI_Send(&send_buff3, 1, MPI_INT, 0, tag, MPI_COMM_WORLD);
MPI_Send(&send_buff3, 1, MPI_INT, 3, tag, MPI_COMM_WORLD);
}
else if(rank==2){
MPI_Recv(&recv_buff, 1, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD);
//do something with receive buffer
}
else{ //if rank == 3
MPI_Send(send_buff, 1, MPI_FLOAT, 0, 5, MPI_COMM_WORLD);
MPI_Recv(recv_buff, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
Lets look at what is happening at each rank in the above example
Rank 0 is trying to send to rank 1 and rank 2, and receive from rank 1 andd 3.
Rank 1 is trying to send to rank 0 and rank 3 and not receive anything from any other ranks
Rank 2 is trying to receive from rank 0 and later do some operation with the data received in the recv_buff.
Rank 3 is trying to send to rank 0 and receive from rank 1
Where beginners get confused is that rank 0 is sending to rank 1 but rank 1 hasn't started any receive operation hence the communication should block or stall and the second send statement in rank 0 should not be executed at all (and this is what MPI documentation stress that it is implementation defined whether or not the outgoing message will be buffered or not). In most of the modern system, such messages of small sizes (here size is 1) will easily be buffered and MPI_Send will return and execute its next MPI_Send statement. Hence in above example, even if the receive in rank 1 is not started, 1st MPI_Send in rank 0 will return and it will execute its next statement.
In a hypothetical situation where rank 3 starts execution before rank 0, it will copy the outgoing message in the first send statement from the send buffer to a system buffer (in a modern system ;) ) and then start executing its receive statement. As soon as rank 0 finishes its two send statements and begins executing its receive statement, the data buffered in system by rank 3 is copied in the receive buffer in rank 0.
In case there's a receive operation started in a processor and no matching send is posted, the process will block until the receive buffer is filled with the data it is expecting. In this situation an computation or other MPI communication will be blocked/halted unless MPI_Recv has returned.
Having understood the buffering phenomena, one should return and think more about MPI_Ssend which has the true semantics of a blocking communication. Even if MPI_Ssend copies the outgoing message from send buffer to a system buffer (which again is implementation defined), one must note MPI_Ssend will not return unless some acknowledge (in low level format) from the receiving process has been received by the sending processor.
Fortunately MPI decided to keep things easer for the users in terms of receive and there is only one receive in Blocking communication : MPI_Recv, and can be used with any of the four send modes described above. For MPI_Recv, blocking means that receive returns only after it contains the data in its buffer. This implies that receive can complete only after a matching send has started but doesn't imply whether or not it can complete before the matching send completes.
What happens during such blocking calls is that the computations are halted until the blocked buffer is freed. This usually leads to wastage of computational resources as Send/Recv is usually copying data from one memory location to another memory location, while the registers in cpu remain idle.
NON-BLOCKING COMMUNICATION : For Non-Blocking Communication, the application creates a request for communication for send and / or receive and gets back a handle and then terminates. That's all that is needed to guarantee that the process is executed. I.e the MPI library is notified that the operation has to be executed.
For the sender side, this allows overlapping computation with communication.
For the receiver side, this allows overlapping a part of the communication overhead , i.e copying the message directly into the address space of the receiving side in the application.
Solution 3:
In using blocking communication you must be care about send and receive calls for example look at this code
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1)
}
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
- Process 0 sends x to process 1 and blocks until process 1 receives x.
- Process 1 sends y to process 0 and blocks until process 0 receives y, but
- process 0 is blocked such that process 1 blocks for infinity until the two processes are killed.