Hi I am currently debugging an OpenMPI (v 4.0) application on an HPC. During MPI_Finalize I get the following error Message (for each rank):
[fh2n0328:883488:0:883488] rc_mlx5_common.c:1097 Assertion iface->cq[UCT_IB_DIR_TX].cq_ci == uct_ib_mlx5_get_cq_ci(ib_iface->cq[UCT_IB_DIR_TX]) failedWhat does this assertion mean? This assertion fails in MPI_Finalize. The Following code produces the failure:
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
}
Note that the failure only occurs if i use more than 2 nodes. I run 20 threads per node. Furthermore, the failure does not occur if use calls like MPI_Comm_size and MPI_Comm_rank in stead of MPI_Barrier. If I exchange MPI_Barrier by an other MPI call which needs communication, the assertion fails too (MPI_Alltoall, MPI_Scatter…)
Source: Windows Questions C++