Parallel MPI_File_open failed on NFSv4 but ran on NFSv3

When using NFSv4, my client reported that their MPI program sometimes report file cannot open or file not found error.

I compiled a sample MPI-IO program and confirmed that, if the MPI procs on computing nodes are trying to access a same file shared from NFS, the program will fail. After several inspection, it turns out that change NFS mount from v4.1 to v3 eliminated this problem.

I'd still like to use NFSv4 because of its safety and potential speed boost. So I'd like to know what arguments should I add to make it work.

OS: CentOS 7.6 updated to latest, nfs-utils 1.3.0, kernel 3.10.0-957.12.2

Server export:

/home 10.0.214.0/24(rw,no_subtree_check,no_root_squash)

Client fstab:

ib-orion-io1:/home /home nfs defaults,rdma,port=20049,nodev,nosuid 0 2

NFSv4 client mount:

ib-orion-io1:/home on /home type nfs4 (rw,nosuid,nodev,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.0.214.11,local_lock=none,addr=10.0.214.5)

NFSv3 client mount

ib-orion-io1:/home on /home type nfs (rw,nosuid,nodev,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=10.0.214.5,mountvers=3,mountproto=tcp,local_lock=none,addr=10.0.214.5)

Error shown on NFSv4 client

Testing simple MPIO program with 112 processes accessing file tttestfile
    (Filename can be specified via program argument)
Proc 0: hostname=node001
Proc 0: MPI_File_open failed (Other I/O error , error stack:
ADIO_OPEN(219): open failed on a remote node)
Proc 66: MPI_File_open failed (File does not exist, error stack:
ADIOI_UFS_OPEN(39): File tttestfile does not exist)
Proc 1: MPI_File_open failed (Other I/O error , error stack:
ADIO_OPEN(219): open failed on a remote node)
Proc 84: MPI_File_open failed (File does not exist, error stack:
ADIOI_UFS_OPEN(39): File tttestfile does not exist)

Sample Parallel MPI File IO program is taken from HDF5.

See "==> Sample_mpio.c <==" paragraph in https://support.hdfgroup.org/ftp/HDF5/current/src/unpacked/release_docs/INSTALL_parallel


I found out that it is because NFSv4 defaults to "ac". So when rank 0 in MPI created the file, the other procs started to open it after several milliseconds. NFS client returned cached information, and there goes "file not found".

When added "noac" option, things went smoothly again.

Edit: Writing still proved to be with error. I will try with NFSv3 in later days. Example Code:

#include <mpi.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
    char hostname[16];
    char readhost[16];
    int  mpi_size, mpi_rank;
    MPI_File fh;
    char *filename = "./mpirw.data";

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    gethostname(hostname, 16);

    if (mpi_rank == 0)
    {
        MPI_File_open(MPI_COMM_SELF, filename,
                MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
        printf("%d@%s created file\n", mpi_rank, hostname);
        MPI_File_close(&fh);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    MPI_File_open(MPI_COMM_WORLD, filename,
            MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
    printf("%d@%s opened file\n", mpi_rank, hostname);
    MPI_Status status;
    int count = strlen(hostname);
    MPI_File_write_at(fh, mpi_rank * 16,
            hostname, count + 1, MPI_CHAR, &status);
    printf("%d@%s wrote OK\n", mpi_rank, hostname);
    MPI_Barrier(MPI_COMM_WORLD);

    if (mpi_rank == 0)
        MPI_File_write_at(fh, mpi_size * 16, "\n", 1, MPI_CHAR,  &status);

    MPI_File_read_at(fh, mpi_rank * 16,
            readhost, count + 1, MPI_CHAR, &status);
    if (strcmp(hostname, readhost) != 0)
        printf("%d@%s read ERROR, got %s\n", mpi_rank, hostname, readhost);
    else
        printf("%d@%s read OK, got %s\n", mpi_rank, hostname, readhost);

    MPI_File_close(&fh);
}

Although the program might report "read OK", but hexdump the output shows the output is truncated.