NVML and Variorum based energy measurement tool for Kokkos #295

ethan-puyaubreau · 2025-06-17T19:16:59Z

Hello,

I am submitting this draft pull request to share my progress on two separate energy profiling tools for Kokkos-based applications. Both tools leverage Kokkos profiling hooks (kokkosp_begin/end_parallel_for, kokkosp_begin/end_parallel_reduce, kokkosp_begin/end_parallel_scan) to accurately record kernel start and end times, and at application finalization they produce structured outputs combining power readings with kernel durations.

The first tool uses the Variorum-Kokkos connector (profiling/variorum-connector), continuously sampling GPU power via Variorum at a 20 ms interval. This interval was chosen empirically as the maximum refresh rate supported by NVIDIA drivers (AMD exact refresh rate to be verified); below this threshold, the driver does not update power data. A 20 ms cadence ensures consistency with hardware and software constraints.

The second tool relies on NVIDIA’s Management Library (NVML) API to query power draw directly. By avoiding the JSON parsing step required by Variorum, it achieves a simpler integration in NVIDIA-only environments, albeit at the cost of portability.

To date, both tools focus exclusively on GPU energy consumption. CPU measurement (e.g., via RAPL) is not yet supported due to permission and compatibility challenges. Similarly, the 20 ms sampling granularity may miss short-lived kernels (under 20 ms) when they occur in rapid succession. I am investigating estimation techniques based on FLOPS/Watt models, which correlate computational work to energy use via a GPU consumption matrix, to (try to) fill this gap.

GPUs also exhibit power-state transition behaviors depending on compute-bound versus memory-bound phases, introducing latency between power levels. I am still measuring workload characteristics with benchmark testing to estimate transition latencies. Comparing measured power profiles with these estimates will highlight any transition overhead as an “energy delta”, "wasted" in some way.

Here are some early graphs generated from data collected by the Variorum connector:

I also use Grafana dashboards for interactive visualization (example below), along with Python/Matplotlib scripts and Perfetto for post-processing. All tools consume a unified output format to ensure portability:

In a bytes-flops benchmark designed to analyze GPU power transition latency, this plot illustrates how the GPU moves between power levels:

All benchmarks I currently use, along with some post-processing scripts, can be found at: https://github.com/ethan-puyaubreau/kokkos-energy-benchmarks

This project remains a preliminary experiment. I welcome feedback on extending CPU support, improving sub-interval energy estimation, handling multi-GPU configurations, and refining power-transition models.

ethan-puyaubreau · 2025-06-19T15:37:18Z

This work would be joined with #296 as this tool can exist as deamon (measuring GPU or any component with a specified interval) combined with a refactored version of the kernel timer that would allow access to a unified time measurement interface.

ethan-puyaubreau · 2025-06-23T21:04:21Z

Note: Some of the used mechanics such as PowerProfiler::Daemon would be meant to be outside of PowerProfiler because of its generalist nature. PowerProfiler would be renamed EnergyProfiler or another name around the same idea of energy more than power, ideas are welcome.

ethan-puyaubreau · 2025-06-24T17:27:42Z

After several changes made to the actual data model, the Python script made to generate plots of data is not relevant anymore. However, I made a self sufficient docker based tool using Grafana and PostgreSQL that allows for easy visualization of profiled data, available here: https://github.com/ethan-puyaubreau/kokkos-energy-dashboard

Here are some screenshots of the actual interface:

Please note that Grafana allows for infinite possibilities of data visualization (e.g. https://grafana.com/grafana/dashboards/) and therefore any ideas on what users want to see on the tool's dashboard are welcome. Currently, the docker tool only needs the Kokkos Tools CSV output files and doesn't need any hands-on experience with Docker/Grafana/PostgreSQL to access these graphs, allowing for an almost turn key tool.

Profiling with more advanced programs is still in progress to find the best solutions to visualize profiled data in specific situations (high kernel number, etc). In the meantime, the profiler's architecture is now modular enough to allow for implementation of new data providers (such as PAPI), that would be allowing the introduction of CPU and multiple GPUs measurement.

ethan-puyaubreau · 2025-06-25T14:01:50Z

Here are some more WIP screenshots of the interface, using output from the tool to calculate metrics for the user to see (e.g. power wasted outside of kernels or waiting for a CPU kernel to end) :

vlkale · 2025-06-25T16:20:35Z

@ethan-puyaubreau Thanks for putting this together, and these example result screenshots look good. I agree on your note about higher fidelity performance data via PAPI. I would also suggest looking into profiling/logging data from GPU vendor tooling libraries, e.g., nvtx or CUPTI from NVIDIA.

I assume you have discussed this with developers of Variorium, e.g., @tpatki

Some of this may be related to LDMS for HPC Systems Monitoring. @vsurjadidjaja

tpatki · 2025-06-25T18:33:08Z

We hadn't heard of this, but this is cool to visualize on a dashboard! Thanks @vlkale for tagging me.

We'd like to document and link this through Variorum as well once it is ready, that way other users can benefit from it. Which architectures has this been tested on and are there docs for users for it yet? Sorry, I haven't had a chance to look through the PR in detail.

Tagging @slabasan, @kshoga1 and @rountree on this as well.

We also have LDMS and Variorum integration @vsurjadidjaja @vlkale. That can be found here for those who want to use this: https://github.com/ovis-hpc/ldms/tree/main/ldms/src/contrib/sampler/variorum_sampler.

ethan-puyaubreau · 2025-06-25T18:43:46Z

Hi @tpatki, thanks for the feedback! For the first steps, I've been using my own computer with a Nvidia Ampere GPU, but I'm currently testing several new architectures to implement CPU profiling too. I would make a comprehensive documentation for users right after having made the right modifications for CPU profiling, though I would be interested to know what you consider sufficient of a documentation for users to be able to use this (haven't done a lot of complete documentation before, hence my question).

tpatki · 2025-06-25T18:52:09Z

@ethan-puyaubreau

Great to hear, we can also potentially help test on some of that architectures at our end.
I'll take a detailed look at the PR and make any suggestions if needed as well in the next week or two.

In terms of documentation, it'll be good to document (1) how to build/install the viz component with Grafana/PostgresSQL along with any dependencies needed for installation, (2) the different viewgraphs that are currently supported for the user, and (3) what architectures has it been tested on and expected to be supported on.

I (and the Variorum team) can help with some of this as well, it'll be good to include the links and documentation in the main Variorum repo here. Maybe we can create another page (rst file) for the Kokkos connector along with your tool under Integrations when it is ready.

One question I had was on interactive visualizations: does Grafana support that? I haven't used it much, hence the question. It will be cool to be able to zoom-in/zoom-out in timeline graphs, and maybe select per-component viz (I think you're only doing GPU energy at the moment, but we can easily extend to show CPU and Mem as well), and generate some other summary stats (you already have some in your viz, we can extend these). Happy to help brainstorm and also work on this as things progress.

ethan-puyaubreau · 2025-06-25T19:01:55Z

I would definitely appreciate the potential help with testing on other architectures. Looking forward to your detailed review of the PR and any suggestions you might have indeed.

To make installation relatively straightforward, the entire visualization stack is currently encapsulated in a pre-configured docker-compose stack. It essentially just takes the tool's output files as input. I'll detail this process in the documentation, including any necessary dependencies.

To answer your question about interactive visualizations in Grafana: Yes, absolutely! Afaik Grafana is one of the most robust open source tools for this. You can definitely zoom in/out on timeline graphs, pan across the data, select specific time ranges for detailed inspection and filter data based on various parameters. Here's more information on the platform itself: https://grafana.com/ (don't mind their cloud solution, the whole system can be self-hosted and that's what I'm doing in this case)

For instance, we can easily extend the current GPU energy visualization to allow users to select and view per-component data (CPU, memory, etc.) interactively. Grafana's dashboarding features also make it simple to add and display additional summary statistics right alongside the graphs.

I'm really eager to brainstorm and collaborate on extending these capabilities as things progress. Thanks for offering your help!

ethan-puyaubreau · 2025-07-01T16:42:42Z

Hi! @tpatki I added one tool from NVML library (kp-nvml-energy) that leverages the nvmlDeviceGetTotalEnergyConsumption() API call, that gets the millijoules value directly from the driver. However, the values seems to be completely off (more than 1 kJ for a 2 seconds calculation, far more than what my 35W GPU is capable of). Did you stumble upon the same situation when adding metrics to Variorum (especially this one specific metric, because I see you didn't use this one in Variorum)?

tpatki · 2025-07-01T22:27:41Z

Hi @ethan-puyaubreau
Interesting. In Variorum, we are reporting instantaneous power, so we didn't use that NVML energy API. FWIW, we do have a PR open for the energy API for GPUs that uses the nvmlDeviceGetTotalEnergyConsumption() API. I don't recall seeing the issue that you're seeing with early testing of that PR, but I may not have done a very thorough test comparing it against instantaneous power values.

What are you comparing your result from nvmlDeviceGetTotalEnergyConsumption() API with? Are you looking at data dumped from the Variorum instantaneous power function or with a direct call to nvmlDeviceGetPowerUsage API? I'm curious what test you're running and what the baseline value for your energy readings is.

ethan-puyaubreau · 2025-07-02T12:53:42Z

Hi @tpatki, I compared the data from the estimated energy integration from nvmlDeviceGetPowerUsage() to the results from nvmlDeviceGetTotalEnergyConsumption(). The results seems to indicate that integrating the power is giving more realistic results for now.

Here is my new test, using the benchmark Kokkos code of this current PR, profiling/energy-profiler/energy-benchmark/energy-benchmark.cpp:

Accessing Variorum Data:

Accessing NVML directly via nvmlDeviceGetPowerUsage():

Both tools estimating energy around 6.2 or 6.3 kJ.

As for sampling with nvmlDeviceGetTotalEnergyConsumption():

Estimating the final energy consumed to around 24.4 kJ, nowhere near the other measurements.

For more in depth analysis of the results, I can give you the raw data used for this dashboard (attached to this message)
benchmark_laptop_1.zip

The code has run for only one iteration (so no mean values and no multiple batches). To use with the current dashboard, the only dependencies would be Python and Docker Compose, you would need to extract the .zip file inside of the input folder in the dashboard tool so to merge with the placeholder folders, then launch the whole system using the setup.sh, and turning it off with remove.sh. Some more information on how this works is specified in the specific repo:
https://github.com/ethan-puyaubreau/kokkos-energy-dashboard

More data outputs are on the way, as I am currently running this new benchmark on other platforms to check for more measurements/results.

ethan-puyaubreau · 2025-07-02T13:00:30Z

The benchmark's length of around 4mins is intentional, allowing for the isolated execution of various tools to mitigate performance degradation during testing. This strategy would help account for the overhead of tools (e.g. execution time being 262.261 on NVML Energy Profiler and 262.078 on Variorum). Input file generation utilized input/generic_script.sh from kokkos-energy-dashboard, which repeatedly executes the same program with each tool for a set number of iterations.

tpatki · 2025-07-02T22:55:14Z

@ethan-puyaubreau
You're probably doing this correctly, but just as a sanity check:
Can you point me to where in the code you are calculating energy -- are you taking a delta between the two reported values?

Also, unlike power, "sampling" energy won't make sense, rather you'd need the equivalent of start_measurement and end_measurement around the region of interest (e.g. a function) and then take the delta across those two values from the API. The API returns the value in mJ since the driver was last loaded, another thing to check would be any error in the conversion there from mJ --> kJ.

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g732ab899b5bd18ac4bfb93c02de4900a

ethan-puyaubreau · 2025-07-02T23:06:36Z

@tpatki Sure!

The relevant code is located here:
https://github.com/ethan-puyaubreau/kokkos-tools/blob/853bb2d7715a9e4039891ae2753d376dba170d40/profiling/energy-profiler/variorum/variorum_energy_profiler.cpp#L312-L350

I am indeed calculating a delta between two reported values for the Variorum based tool, but not for the nvmlDeviceGetTotalEnergyConsumption based tool as I'm extracting the raw value and adding it to the measurements.

Regarding the unit conversion, the Grafana dashboard is indeed set up to interpret the values as millijoules. So as you mentioned, it’s likely that nvmlDeviceGetTotalEnergyConsumption needs to be called at the start and end of each code region. I’ll definitely try that out and see how it behaves.

profiling/energy-profiler/energy-benchmark/CMakeLists.txt

profiling/energy-profiler/energy-benchmark/src/energy_benchmark.cpp

profiling/energy-profiler/nvml/kp_power_nvml.cpp

profiling/energy-profiler/nvml/CMakeLists.txt

…_library

JBludau · 2025-08-14T17:42:56Z

profiling/energy-profiler/kokkos/kp_nvml_direct_power.cpp

+void power_monitoring_tick() {
+  if (!g_nvml_provider || !g_nvml_provider->is_initialized()) {
+    return;
+  }


They also used this line of code when it comes to directly getting the NVML_FI_DEV_POWER_INSTANT afaik

JBludau

first pass

JBludau · 2025-08-14T17:43:06Z

profiling/energy-profiler/kokkos/kp_nvml_direct_power.cpp

+
+// --- Configuration ---
+// The interval in milliseconds for power sampling.
+constexpr int SAMPLING_INTERVAL_MS = 20;


we should set this based on the "Part time power measurements ..." paper

JBludau · 2025-08-14T17:43:13Z

profiling/energy-profiler/kokkos/kp_energy_kernel_timer.cpp

+// --- Core Initialization ---
+KernelTimerTool timer;
+
+bool VERBOSE = false;


hmmm ... I think this would rather be a CMake option than something that gets manually changed in the source code. And you could make it a compile time decision

Fixed, that would also allow for verbose to expand to energy measurement tools and not only the timer.

JBludau · 2025-08-14T17:43:17Z

profiling/energy-profiler/common/timer.hpp

+struct EnergyTimer {
+ public:
+  void start_timing(uint64_t timing_id, RegionType type, std::string name);
+  void end_timing(uint64_t timing_id);
+  std::unordered_map<uint64_t, EnergyTiming>& get_timings();
+
+ private:
+  std::unordered_map<uint64_t, EnergyTiming> timings_;
+};


hmm ... not sure if it is worth having the class and not just two free funcs and the unordered map

Indeed, I can change that to something simpler.

JBludau · 2025-08-14T17:43:20Z

profiling/energy-profiler/common/timer.hpp

+  uint64_t id = 0;
+};
+
+struct EnergyTiming {


Why did you call it EnergyTiming?
I think I would rather have dedicated start and end or start at construction and something like now and reset.

It was a remnant of the hard coded version of the variorum provider system, that can definitely be changed.

JBludau · 2025-08-14T17:43:25Z

profiling/energy-profiler/common/timer.cpp

+namespace KokkosTools {
+namespace Timer {
+
+void export_kernels_csv(const std::deque<TimingInfo>& timings,


hmm this is a lot of repetition ...

Currently working on refactoring this because indeed there is a lot of unnecessary repetition.

JBludau · 2025-08-14T17:45:24Z

profiling/energy-profiler/kokkos/kp_nvml_direct_power.cpp

+#include <iostream>
+#include <vector>
+#include <string>
+#include <chrono>
+#include <mutex>
+#include <iomanip>
+#include <cmath>
+#include <fstream>
+#include <memory>
+
+#include "kp_core.hpp"
+#include "../common/daemon.hpp"
+#include "../provider/provider_nvml.hpp"
+#include "../common/filename_prefix.hpp"
+#include "../common/timer.hpp"
+#include "../tools/kernel_timer_tool.hpp"


JBludau · 2025-08-14T17:46:06Z

profiling/energy-profiler/kokkos/kp_nvml_energy_consumption.cpp

+#include "../tools/kernel_timer_tool.hpp"
+
+namespace KokkosTools {
+namespace EnergyConsumption {


There seems to be a lot of repetition ... maybe we should focus on the one we ended up using at the end

JBludau · 2025-08-14T17:48:23Z

profiling/energy-profiler/provider/provider_nvml.cpp

+
+  for (size_t i = 0; i < devices_.size(); ++i) {
+    double device_power = get_device_power_usage(i);
+    if (device_power >= 0.0) {


why do we have to check that here? are negative values expected?

Also a remnant of NVML direct power measurement.

JBludau · 2025-08-14T17:49:02Z

profiling/energy-profiler/provider/provider_nvml.cpp

+
+double NVMLProvider::get_device_power_usage(size_t device_index) {
+  if (!initialized_ || device_index >= devices_.size()) {
+    return -1.0;


hmm ... I see ... maybe you could use an optional for this

For the function definition or for the if condition? The if condition can definitely be simplified, the current version is the one sanity check I see pretty much everywhere when NVML is involved.

JBludau · 2025-08-14T17:51:35Z

profiling/energy-profiler/provider/provider_nvml.hpp

+  bool is_initialized() const { return initialized_; }
+
+ private:
+  bool initialized_;


maybe call it "in_working_state" since you assume that when it is false it is either finalized or was never initialized

Updated the name as of now.

ethan-puyaubreau · 2025-08-18T16:10:21Z

Hello, this PR has been subdivided into multiple blocks: #299, #300, #301 and #302.

ethan-puyaubreau changed the title ~~GPU Energy consumption profiler based on variorum connector~~ Energy Consumption profiling tool Jun 20, 2025

dalg24 reviewed Jul 8, 2025

View reviewed changes

ethan-puyaubreau added 4 commits July 12, 2025 00:34

energy-profiler: add basic structure and documentation

97c884b

energy-profiler: add NVML support for GPU energy monitoring

5931cab

energy-profiler: add variorum support for gpu energy monitoring

7d9565a

energy-profiler: integrate into main build system

48780b0

ethan-puyaubreau changed the title ~~Energy Consumption profiling tool~~ NVML and Variorum based energy measurement tool for Kokkos Jul 12, 2025

ethan-puyaubreau force-pushed the energy-profiler branch from 2fd3ee4 to 48780b0 Compare July 14, 2025 11:43

ethan-puyaubreau added 7 commits July 15, 2025 09:43

Merge branch 'kokkos:develop' into energy-profiler

573ea3b

energy-profiler: add NVML power tool

fbf4f79

energy-profiler: refactor NVML power profiler and update output formats

51e900b

energy-profiler: fix filename generation for output files in finalize…

c55fb56

…_library

energy-profiler: fix kp_energy and improve cmake

032538b

energy-profiler: refactor energy tool

1c31665

energy-profiler: fix warnings

eadef98

energy-profiler: suppress unused variable warnings in kernel_timer_tool

7d46730

JBludau reviewed Aug 14, 2025

View reviewed changes

NVML and Variorum based energy measurement tool for Kokkos #295

Are you sure you want to change the base?

NVML and Variorum based energy measurement tool for Kokkos #295

Uh oh!

Conversation

ethan-puyaubreau commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jun 19, 2025

Uh oh!

ethan-puyaubreau commented Jun 23, 2025

Uh oh!

ethan-puyaubreau commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jun 25, 2025

Uh oh!

vlkale commented Jun 25, 2025

Uh oh!

tpatki commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jun 25, 2025

Uh oh!

tpatki commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jul 1, 2025

Uh oh!

tpatki commented Jul 1, 2025

Uh oh!

ethan-puyaubreau commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jul 2, 2025

Uh oh!

tpatki commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethan-puyaubreau commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JBludau left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethan-puyaubreau commented Jun 17, 2025 •

edited

Loading

ethan-puyaubreau commented Jun 24, 2025 •

edited

Loading

tpatki commented Jun 25, 2025 •

edited

Loading

tpatki commented Jun 25, 2025 •

edited

Loading

ethan-puyaubreau commented Jun 25, 2025 •

edited

Loading

ethan-puyaubreau commented Jul 2, 2025 •

edited

Loading

tpatki commented Jul 2, 2025 •

edited

Loading

ethan-puyaubreau commented Jul 2, 2025 •

edited

Loading