@@ -1232,26 +1232,6 @@ to your output directory. The following example is run on the host `amd-ryzen`:
12321232 ├── roofline.csv
12331233 └── sysinfo.csv
12341234
1235- .. note ::
1236- For profiling multi-rank workloads with MPI communication, use ``--iteration-multiplexing `` and do not turn on PC sampling with `-b 21 `.
1237-
1238- .. warning ::
1239- MPI launchers (``mpirun ``, ``mpiexec ``, ``srun ``, ``orterun ``) must wrap the
1240- ``rocprof-compute `` command, not appear after ``-- ``. The following is **incorrect **:
1241-
1242- .. code-block :: shell-session
1243-
1244- $ rocprof-compute profile --name my_app -- mpirun -n 4 ./my_application # WRONG
1245-
1246- Instead, use the correct form where the MPI launcher wraps ``rocprof-compute ``:
1247-
1248- .. code-block :: shell-session
1249-
1250- $ mpirun -n 4 rocprof-compute profile --name my_app -- ./my_application # CORRECT
1251-
1252- If you use an MPI launcher after ``-- ``, an error will be raised with guidance
1253- on the correct usage.
1254-
12551235 ROCm Compute Profiler supports the following libraries, APIs and job schedulers:
12561236
12571237* OpenMPI
@@ -1270,4 +1250,78 @@ specify the output directory as follows:
12701250
12711251.. code-block :: shell-session
12721252
1273- $ mpirun -n 4 rocprof-compute profile --output-directory /tmp/mpi_profile/%env{MY_MPI_RANK}% -- ./my_mpi_application
1253+ $ mpirun -n 4 rocprof-compute profile --output-directory /tmp/mpi_profile/%env{MY_MPI_RANK}% -- ./my_mpi_application
1254+
1255+ Limitations and Recommendations
1256+ -------------------------------
1257+
1258+ When profiling multi-rank applications, be aware of the following limitations:
1259+
1260+ **MPI Launcher Placement **
1261+
1262+ MPI launchers (``mpirun ``, ``mpiexec ``, ``srun ``, ``orterun ``) must wrap the
1263+ ``rocprof-compute `` command, not appear after ``-- ``. The following is **incorrect **:
1264+
1265+ .. code-block :: shell-session
1266+
1267+ $ rocprof-compute profile --name my_app -- mpirun -n 4 ./my_application # WRONG
1268+
1269+ Instead, use the correct form where the MPI launcher wraps ``rocprof-compute ``:
1270+
1271+ .. code-block :: shell-session
1272+
1273+ $ mpirun -n 4 rocprof-compute profile --name my_app -- ./my_application # CORRECT
1274+
1275+ If you use an MPI launcher after ``-- ``, an error will be raised with guidance
1276+ on the correct usage.
1277+
1278+ **Application Replay Mode (Default) **
1279+
1280+ By default, ROCm Compute Profiler uses application replay mode, which runs the
1281+ workload multiple times to collect all performance counters. This mode fails
1282+ for MPI applications because running the application multiple times results in
1283+ multiple ``MPI_Init `` and ``MPI_Finalize `` calls, which is not permitted by the
1284+ MPI specification.
1285+
1286+ **PC Sampling **
1287+
1288+ PC sampling (block 21) may fail to collect data for multi-rank applications with
1289+ MPI communication due to synchronization requirements.
1290+
1291+ **Recommended Single-Pass Modes **
1292+
1293+ For multi-rank applications with MPI communication, use one of these single-pass
1294+ profiling modes:
1295+
1296+ * ``--iteration-multiplexing ``: Collects all counters in a single application run
1297+ by distributing counter collection across kernel dispatches. Recommended for
1298+ applications with sufficient kernel dispatch counts.
1299+
1300+ .. code-block :: shell-session
1301+
1302+ $ mpirun -n 4 rocprof-compute profile --name my_mpi_app --iteration-multiplexing -- ./my_mpi_app
1303+
1304+ * ``--block <N> ``: Profiles only specific metric block(s), reducing the number of
1305+ counters collected to fit in a single pass.
1306+
1307+ .. code-block :: shell-session
1308+
1309+ $ mpirun -n 4 rocprof-compute profile --name my_mpi_app --block 0 -- ./my_mpi_app
1310+
1311+ * ``--set <name> ``: Profiles a predefined counter set that fits in a single pass.
1312+
1313+ .. code-block :: shell-session
1314+
1315+ $ mpirun -n 4 rocprof-compute profile --name my_mpi_app --set compute_thruput_util -- ./my_mpi_app
1316+
1317+ **Multi-Node Profiling **
1318+
1319+ When profiling across multiple nodes, ensure that:
1320+
1321+ * Output directories are accessible from all nodes (shared filesystem), or
1322+ * Use node-specific output directories with ``%hostname% `` placeholder
1323+
1324+ .. code-block :: shell-session
1325+
1326+ $ mpirun -n 8 --hostfile hosts.txt rocprof-compute profile \
1327+ --output-directory /shared/profiles/%hostname%/%rank% -- ./my_mpi_app
0 commit comments