Skip to content

Crashing when using --traced-rank with non-local-master ranks that are not on the same node as rank 0 #468

@colleeneb

Description

@colleeneb

When I try to get a timeline and also filter out ranks so I'm not getting all ranks, I see a crash when I use non-local-master-ranks that are on nodes different from rank 0. For example, for 2 nodes, each with 12 ranks, for: mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,13 -- ./a.out I see a crash:

/home/applenco/thapi_devel_clean/build/ici/bin/iprof:93:in `exec': /home/applenco/thapi_devel_clean/build/ici/bin/babeltrace_thapi to_interval --output /home/bertoni/thapi-traces/thapi_interval--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 --backends  -- /tmp/thapi--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 failed (RuntimeError)
        from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:698:in `bt_analysis'
        from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:967:in `block in all_trace_and_processing'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:457:in `open'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:953:in `all_trace_and_processing'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:1108:in `<main>'
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 4 exited with code 1
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 6 died from signal 15

This is reproducible for me on sunspot and aurora with 2 nodes.

A few things to note:

  • Using other ranks on the same node as rank 0 works
  • If I use the local master of the node it works. That is, mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,12 ... works fine. but --traced-rank 0,13 does not. So maybe something is defined for local masters but not all ranks?
  • From the error message we can see that in the babeltrace_thapi invocation the argument to --backends is missing, so potentially for the non-local master ranks, it doesn't have backends set to anything. I tried just setting it here:
    opts << "--backends #{backends.join(',')}"
    . This change did avoid the crash but the timeline was missing the second rank. So it's a more complicated fix.
  • I confirm that the timeline works fine when --traced-rank is not used as well

The following can reproduce it:

> cat flops.cpp
#include "mpi.h"
int main(int argc, char **argv) {
  MPI_Init(NULL, NULL);
  MPI_Finalize();
}
> mpicxx flops.cpp -o flops

> mpiexec --np 8 -ppn 4  /home/applenco/thapi_devel_clean/build/ici/bin/iprof -l -b mpi:3 --traced-ranks 0,5 -- gpu_tile_compact.sh ./flops
/home/applenco/thapi_devel_clean/build/ici/bin/iprof:93:in `exec': /home/applenco/thapi_devel_clean/build/ici/bin/babeltrace_thapi to_interval --output /home/bertoni/thapi-traces/thapi_interval--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 --backends  -- /tmp/thapi--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 failed (RuntimeError)
        from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:698:in `bt_analysis'
        from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:967:in `block in all_trace_and_processing'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:457:in `open'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:953:in `all_trace_and_processing'
	from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:1108:in `<main>'
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 4 exited with code 1
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 6 died from signal 15

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions