When I try to get a timeline and also filter out ranks so I'm not getting all ranks, I see a crash when I use non-local-master-ranks that are on nodes different from rank 0. For example, for 2 nodes, each with 12 ranks, for: mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,13 -- ./a.out I see a crash:
/home/applenco/thapi_devel_clean/build/ici/bin/iprof:93:in `exec': /home/applenco/thapi_devel_clean/build/ici/bin/babeltrace_thapi to_interval --output /home/bertoni/thapi-traces/thapi_interval--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 --backends -- /tmp/thapi--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 failed (RuntimeError)
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:698:in `bt_analysis'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:967:in `block in all_trace_and_processing'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:457:in `open'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:953:in `all_trace_and_processing'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:1108:in `<main>'
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 4 exited with code 1
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 6 died from signal 15
This is reproducible for me on sunspot and aurora with 2 nodes.
A few things to note:
- Using other ranks on the same node as rank 0 works
- If I use the local master of the node it works. That is,
mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,12 ... works fine. but --traced-rank 0,13 does not. So maybe something is defined for local masters but not all ranks?
- From the error message we can see that in the
babeltrace_thapi invocation the argument to --backends is missing, so potentially for the non-local master ranks, it doesn't have backends set to anything. I tried just setting it here:
|
opts << "--backends #{backends.join(',')}" |
. This change did avoid the crash but the timeline was missing the second rank. So it's a more complicated fix.
- I confirm that the timeline works fine when
--traced-rank is not used as well
The following can reproduce it:
> cat flops.cpp
#include "mpi.h"
int main(int argc, char **argv) {
MPI_Init(NULL, NULL);
MPI_Finalize();
}
> mpicxx flops.cpp -o flops
> mpiexec --np 8 -ppn 4 /home/applenco/thapi_devel_clean/build/ici/bin/iprof -l -b mpi:3 --traced-ranks 0,5 -- gpu_tile_compact.sh ./flops
/home/applenco/thapi_devel_clean/build/ici/bin/iprof:93:in `exec': /home/applenco/thapi_devel_clean/build/ici/bin/babeltrace_thapi to_interval --output /home/bertoni/thapi-traces/thapi_interval--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 --backends -- /tmp/thapi--c1efe59e-f544-4ccf-b4dd-65e54d945a90/x4310c1s1b0n0 failed (RuntimeError)
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:698:in `bt_analysis'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:967:in `block in all_trace_and_processing'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:457:in `open'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:953:in `all_trace_and_processing'
from /home/applenco/thapi_devel_clean/build/ici/bin/iprof:1108:in `<main>'
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 4 exited with code 1
x4310c1s1b0n0.hsn.cm.aurora.alcf.anl.gov: rank 6 died from signal 15
When I try to get a timeline and also filter out ranks so I'm not getting all ranks, I see a crash when I use non-local-master-ranks that are on nodes different from rank 0. For example, for 2 nodes, each with 12 ranks, for:
mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,13 -- ./a.outI see a crash:This is reproducible for me on sunspot and aurora with 2 nodes.
A few things to note:
mpirun -n 24 -ppn 12 iprof -l --traced-rank 0,12 ...works fine. but--traced-rank 0,13does not. So maybe something is defined for local masters but not all ranks?babeltrace_thapiinvocation the argument to--backendsis missing, so potentially for the non-local master ranks, it doesn't havebackendsset to anything. I tried just setting it here:THAPI/xprof/xprof.rb.in
Line 693 in 0940d81
--traced-rankis not used as wellThe following can reproduce it: