Energy profiling tools: NVML-based measurement tool #301

ethan-puyaubreau · 2025-08-15T16:00:04Z

This PR adds NVML-based GPU energy monitoring to the Kokkos Energy Profiler:

Features:

Integrated Energy Profiling: Combines timing measurements with GPU power monitoring in a single profiler library
NVML Provider: NVMLProvider class for GPU power monitoring with device discovery and power usage APIs
Background Power Sampling: PowerSampler class using daemon-based background monitoring for continuous power data collection
Multi-GPU Support: Individual device queries for systems with multiple GPUs

Implementation:

Main profiler library: kp_energy_profiler (integrates timing + power monitoring)
NVML provider: nvml_provider.cpp/hpp for GPU power monitoring
Power sampler: power_sampler.cpp/hpp for background sampling with daemon
Timing utilities: Enhanced timing infrastructure with power data export

Dependencies:

Requires CUDA 12.6+ with NVML support
Depends on: Energy profiling tools: Add Daemon class for periodic task execution #300 and Energy profiling tools: Core infrastructure with timing tool and export capabilities #299

Usage:
The profiler automatically detects available GPUs and begins power monitoring during initialization. Power data is collected in the background and exported alongside timing data in CSV format.

profiling/energy-profiler/nvml/kp_nvml_power.cpp

…lity

- Implement NVMLProvider for GPU power monitoring. - Introduce PowerSampler for managing power sampling. - Update CMakeLists.txt to include new source files. - Enhance error logging and handling in timing utilities. - Add power data export functionality to CSV. - Integrate power sampling with existing energy profiling features.

…f log_error function

…filer

…iler

romintomasetti · 2025-09-01T07:42:30Z

CMakeLists.txt

+# Check for NVML (required for energy profiler)
+set(MIN_CUDA_VERSION 12.6)
+find_package(CUDAToolkit ${MIN_CUDA_VERSION} QUIET)
+if (CUDAToolkit_FOUND)


Have you seen https://docs.nvidia.com/nsight-systems/UserGuide/index.html#nvml-power-and-temperature-metrics-preview ?

It would be great to actually better document this tool and compare it to what nsigh-systems may provide. Does your tool make anything special to help correlate Kokkos regions with consumption, trigger anything special ? Or is it "just" launching a thread that samples the energy consumption ?

Also, what about asynchronicity ? I mean:

{ Kokkos::Profiling::ScopeRegion region("my region"); Kokkos::parallel_for(Kokkos::RangePolicy(exec, 0, N), ...); // async }

If the tool reports the consumption of the my region region, if you're not Kokkos::fenceing, what is the meaning of the measurements reported by Kokkos Tools, especially when the kernel is actually running after the scoped region ends ?

Hello! I've seen this page one time but since I focused more on Variorum support at some point I haven't had the chance of reading it further. As of now, the tool is only launching a thread that samples the energy consumption.

Also, what about asynchronicity ? I mean:

{ Kokkos::Profiling::ScopeRegion region("my region"); Kokkos::parallel_for(Kokkos::RangePolicy(exec, 0, N), ...); // async }

If the tool reports the consumption of the my region region, if you're not Kokkos::fenceing, what is the meaning of the measurements reported by Kokkos Tools, especially when the kernel is actually running after the scoped region ends ?

For now, the system doesn't do anything special to correlate power consumption with a specific region or kernel. The timing system is meant to help visualize what the current situation is, so there's room for improvement (meaning adding more correlation using multiple metrics), especially since the daemon system would still allow for method/data propagation

@romintomasetti we will check if Nsight does actually do something useful. It at least sounds like the exact same thing we are doing.
About the fencing: Nvidia at the moment allows power measurements every 100ms. But it seems to return only the power average of the last 25ms of that window, see http://arxiv.org/abs/2312.02741. Thus the tool will at the current state of Nvidas tools only be useful for measuring entire regions and even then it will need repetitions with shifts and post processing in order to get anything that is relatable to an algorithm. Due to these problems the tool currently does not require fencing, this should be done by the user.

Could be good to add the AMD tool to the mix:

https://github.com/ROCm/rocprofiler-systems?tab=readme-ov-file#gpu-metrics

on our ToDo list :-)

@JBludau Do you still have plans to look at the AMD Profiling metrics, as mentioned above? Do you have updates you can share - particularly those that are pertinent to this PR - from your end?

Thanks!

This pr has been split into smaller ones to make it easier to review. Once we have these in, we can think about adding something for amd

romintomasetti

I would not merge this PR without an example.

And I would be strongly against merging this one before we get proper Cuda testing, e.g. #271.

ethan-puyaubreau force-pushed the feature/energy-profiler-nvml branch 2 times, most recently from a642c0c to eb99f1f Compare August 18, 2025 13:30

ethan-puyaubreau mentioned this pull request Aug 18, 2025

NVML and Variorum based energy measurement tool for Kokkos #295

Draft

ethan-puyaubreau marked this pull request as ready for review August 19, 2025 18:17

ethan-puyaubreau force-pushed the feature/energy-profiler-nvml branch 2 times, most recently from 6332e5c to 2b6248c Compare August 28, 2025 19:58

JBludau reviewed Aug 28, 2025

View reviewed changes

ethan-puyaubreau force-pushed the feature/energy-profiler-nvml branch from 2b6248c to 4f8fd2b Compare August 29, 2025 00:34

ethan-puyaubreau mentioned this pull request Aug 29, 2025

Energy profiling tools: Add Daemon class for periodic task execution #300

Open

ethan-puyaubreau force-pushed the feature/energy-profiler-nvml branch 3 times, most recently from 6517006 to 0ed10fc Compare August 29, 2025 15:30

ethan-puyaubreau added 15 commits August 29, 2025 14:27

Add energy profiler module with timing utilities and export functiona…

7ec6d88

…lity

Remove unused chrono include from timing_export.hpp

22429bf

Refactor energy profiler timing functions

f9b227b

clang-format

e30b3fd

Add Daemon class for managing periodic task execution

e0de668

Refactor Daemon::tick

74f435a

Rename variable Daemon::run method

9b52c58

Move files to upper folder

49073bc

Update CMakeLists.txt to enforce minimum CUDA version for NVML support

cac7e07

Enhance energy profiler with NVML support checks in CMake configuration

9fd72b0

Refactor error handling in energy profiler to use std::cerr instead o…

058d9e3

…f log_error function

Refactor logging functions and reduce includes

8c29325

Remove warning for ending region with no active regions in energy pro…

d9a6342

…filer

Format error message for power sampling initialization in energy prof…

9b9e6f5

…iler

ethan-puyaubreau force-pushed the feature/energy-profiler-nvml branch from 0ed10fc to 9b9e6f5 Compare August 29, 2025 18:27

romintomasetti reviewed Sep 1, 2025

View reviewed changes

romintomasetti suggested changes Sep 1, 2025

View reviewed changes

maartenarnst mentioned this pull request Sep 1, 2025

Add gpu runners #304

Open

Energy profiling tools: NVML-based measurement tool #301

Are you sure you want to change the base?

Energy profiling tools: NVML-based measurement tool #301

Uh oh!

Conversation

ethan-puyaubreau commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romintomasetti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ethan-puyaubreau commented Aug 15, 2025 •

edited

Loading

romintomasetti left a comment •

edited

Loading