Skip to content

Conversation

@ethan-puyaubreau
Copy link
Contributor

@ethan-puyaubreau ethan-puyaubreau commented Jun 17, 2025

Hello,

I am submitting this draft pull request to share my progress on two separate energy profiling tools for Kokkos-based applications. Both tools leverage Kokkos profiling hooks (kokkosp_begin/end_parallel_for, kokkosp_begin/end_parallel_reduce, kokkosp_begin/end_parallel_scan) to accurately record kernel start and end times, and at application finalization they produce structured outputs combining power readings with kernel durations.

The first tool uses the Variorum-Kokkos connector (profiling/variorum-connector), continuously sampling GPU power via Variorum at a 20 ms interval. This interval was chosen empirically as the maximum refresh rate supported by NVIDIA drivers (AMD exact refresh rate to be verified); below this threshold, the driver does not update power data. A 20 ms cadence ensures consistency with hardware and software constraints.

The second tool relies on NVIDIA’s Management Library (NVML) API to query power draw directly. By avoiding the JSON parsing step required by Variorum, it achieves a simpler integration in NVIDIA-only environments, albeit at the cost of portability.

To date, both tools focus exclusively on GPU energy consumption. CPU measurement (e.g., via RAPL) is not yet supported due to permission and compatibility challenges. Similarly, the 20 ms sampling granularity may miss short-lived kernels (under 20 ms) when they occur in rapid succession. I am investigating estimation techniques based on FLOPS/Watt models, which correlate computational work to energy use via a GPU consumption matrix, to (try to) fill this gap.

GPUs also exhibit power-state transition behaviors depending on compute-bound versus memory-bound phases, introducing latency between power levels. I am still measuring workload characteristics with benchmark testing to estimate transition latencies. Comparing measured power profiles with these estimates will highlight any transition overhead as an “energy delta”, "wasted" in some way.

Here are some early graphs generated from data collected by the Variorum connector:

Variorum Power Profile 1

Variorum Power Profile 2

I also use Grafana dashboards for interactive visualization (example below), along with Python/Matplotlib scripts and Perfetto for post-processing. All tools consume a unified output format to ensure portability:

Grafana Dashboard

In a bytes-flops benchmark designed to analyze GPU power transition latency, this plot illustrates how the GPU moves between power levels:

GPU Transition Latency Benchmark

All benchmarks I currently use, along with some post-processing scripts, can be found at: https://github.com/ethan-puyaubreau/kokkos-energy-benchmarks

This project remains a preliminary experiment. I welcome feedback on extending CPU support, improving sub-interval energy estimation, handling multi-GPU configurations, and refining power-transition models.

@ethan-puyaubreau
Copy link
Contributor Author

This work would be joined with #296 as this tool can exist as deamon (measuring GPU or any component with a specified interval) combined with a refactored version of the kernel timer that would allow access to a unified time measurement interface.

@ethan-puyaubreau ethan-puyaubreau changed the title GPU Energy consumption profiler based on variorum connector Energy Consumption profiling tool Jun 20, 2025
@ethan-puyaubreau
Copy link
Contributor Author

Note: Some of the used mechanics such as PowerProfiler::Daemon would be meant to be outside of PowerProfiler because of its generalist nature. PowerProfiler would be renamed EnergyProfiler or another name around the same idea of energy more than power, ideas are welcome.

@ethan-puyaubreau
Copy link
Contributor Author

ethan-puyaubreau commented Jun 24, 2025

After several changes made to the actual data model, the Python script made to generate plots of data is not relevant anymore. However, I made a self sufficient docker based tool using Grafana and PostgreSQL that allows for easy visualization of profiled data, available here: https://github.com/ethan-puyaubreau/kokkos-energy-dashboard

Here are some screenshots of the actual interface:
image

Please note that Grafana allows for infinite possibilities of data visualization (e.g. https://grafana.com/grafana/dashboards/) and therefore any ideas on what users want to see on the tool's dashboard are welcome. Currently, the docker tool only needs the Kokkos Tools CSV output files and doesn't need any hands-on experience with Docker/Grafana/PostgreSQL to access these graphs, allowing for an almost turn key tool.

Profiling with more advanced programs is still in progress to find the best solutions to visualize profiled data in specific situations (high kernel number, etc). In the meantime, the profiler's architecture is now modular enough to allow for implementation of new data providers (such as PAPI), that would be allowing the introduction of CPU and multiple GPUs measurement.

@ethan-puyaubreau
Copy link
Contributor Author

Here are some more WIP screenshots of the interface, using output from the tool to calculate metrics for the user to see (e.g. power wasted outside of kernels or waiting for a CPU kernel to end) :

image
image

@vlkale
Copy link
Contributor

vlkale commented Jun 25, 2025

@ethan-puyaubreau Thanks for putting this together, and these example result screenshots look good. I agree on your note about higher fidelity performance data via PAPI. I would also suggest looking into profiling/logging data from GPU vendor tooling libraries, e.g., nvtx or CUPTI from NVIDIA.

I assume you have discussed this with developers of Variorium, e.g., @tpatki

Some of this may be related to LDMS for HPC Systems Monitoring. @vsurjadidjaja

@tpatki
Copy link

tpatki commented Jun 25, 2025

We hadn't heard of this, but this is cool to visualize on a dashboard! Thanks @vlkale for tagging me.

We'd like to document and link this through Variorum as well once it is ready, that way other users can benefit from it. Which architectures has this been tested on and are there docs for users for it yet? Sorry, I haven't had a chance to look through the PR in detail.

Tagging @slabasan, @kshoga1 and @rountree on this as well.

We also have LDMS and Variorum integration @vsurjadidjaja @vlkale. That can be found here for those who want to use this: https://github.com/ovis-hpc/ldms/tree/main/ldms/src/contrib/sampler/variorum_sampler.

@ethan-puyaubreau
Copy link
Contributor Author

Hi @tpatki, thanks for the feedback! For the first steps, I've been using my own computer with a Nvidia Ampere GPU, but I'm currently testing several new architectures to implement CPU profiling too. I would make a comprehensive documentation for users right after having made the right modifications for CPU profiling, though I would be interested to know what you consider sufficient of a documentation for users to be able to use this (haven't done a lot of complete documentation before, hence my question).

@tpatki
Copy link

tpatki commented Jun 25, 2025

@ethan-puyaubreau

Great to hear, we can also potentially help test on some of that architectures at our end.
I'll take a detailed look at the PR and make any suggestions if needed as well in the next week or two.

In terms of documentation, it'll be good to document (1) how to build/install the viz component with Grafana/PostgresSQL along with any dependencies needed for installation, (2) the different viewgraphs that are currently supported for the user, and (3) what architectures has it been tested on and expected to be supported on.

I (and the Variorum team) can help with some of this as well, it'll be good to include the links and documentation in the main Variorum repo here. Maybe we can create another page (rst file) for the Kokkos connector along with your tool under Integrations when it is ready.

One question I had was on interactive visualizations: does Grafana support that? I haven't used it much, hence the question. It will be cool to be able to zoom-in/zoom-out in timeline graphs, and maybe select per-component viz (I think you're only doing GPU energy at the moment, but we can easily extend to show CPU and Mem as well), and generate some other summary stats (you already have some in your viz, we can extend these). Happy to help brainstorm and also work on this as things progress.

@ethan-puyaubreau
Copy link
Contributor Author

ethan-puyaubreau commented Jun 25, 2025

I would definitely appreciate the potential help with testing on other architectures. Looking forward to your detailed review of the PR and any suggestions you might have indeed.

To make installation relatively straightforward, the entire visualization stack is currently encapsulated in a pre-configured docker-compose stack. It essentially just takes the tool's output files as input. I'll detail this process in the documentation, including any necessary dependencies.

To answer your question about interactive visualizations in Grafana: Yes, absolutely! Afaik Grafana is one of the most robust open source tools for this. You can definitely zoom in/out on timeline graphs, pan across the data, select specific time ranges for detailed inspection and filter data based on various parameters. Here's more information on the platform itself: https://grafana.com/ (don't mind their cloud solution, the whole system can be self-hosted and that's what I'm doing in this case)

For instance, we can easily extend the current GPU energy visualization to allow users to select and view per-component data (CPU, memory, etc.) interactively. Grafana's dashboarding features also make it simple to add and display additional summary statistics right alongside the graphs.

I'm really eager to brainstorm and collaborate on extending these capabilities as things progress. Thanks for offering your help!

@ethan-puyaubreau
Copy link
Contributor Author

Hi! @tpatki I added one tool from NVML library (kp-nvml-energy) that leverages the nvmlDeviceGetTotalEnergyConsumption() API call, that gets the millijoules value directly from the driver. However, the values seems to be completely off (more than 1 kJ for a 2 seconds calculation, far more than what my 35W GPU is capable of). Did you stumble upon the same situation when adding metrics to Variorum (especially this one specific metric, because I see you didn't use this one in Variorum)?

@tpatki
Copy link

tpatki commented Jul 1, 2025

Hi @ethan-puyaubreau
Interesting. In Variorum, we are reporting instantaneous power, so we didn't use that NVML energy API. FWIW, we do have a PR open for the energy API for GPUs that uses the nvmlDeviceGetTotalEnergyConsumption() API. I don't recall seeing the issue that you're seeing with early testing of that PR, but I may not have done a very thorough test comparing it against instantaneous power values.

What are you comparing your result from nvmlDeviceGetTotalEnergyConsumption() API with? Are you looking at data dumped from the Variorum instantaneous power function or with a direct call to nvmlDeviceGetPowerUsage API? I'm curious what test you're running and what the baseline value for your energy readings is.

@ethan-puyaubreau
Copy link
Contributor Author

ethan-puyaubreau commented Jul 2, 2025

Hi @tpatki, I compared the data from the estimated energy integration from nvmlDeviceGetPowerUsage() to the results from nvmlDeviceGetTotalEnergyConsumption(). The results seems to indicate that integrating the power is giving more realistic results for now.

Here is my new test, using the benchmark Kokkos code of this current PR, profiling/energy-profiler/energy-benchmark/energy-benchmark.cpp:

Accessing Variorum Data:

image
image

Accessing NVML directly via nvmlDeviceGetPowerUsage():

image

Both tools estimating energy around 6.2 or 6.3 kJ.

As for sampling with nvmlDeviceGetTotalEnergyConsumption():

image

Estimating the final energy consumed to around 24.4 kJ, nowhere near the other measurements.

For more in depth analysis of the results, I can give you the raw data used for this dashboard (attached to this message)
benchmark_laptop_1.zip

The code has run for only one iteration (so no mean values and no multiple batches). To use with the current dashboard, the only dependencies would be Python and Docker Compose, you would need to extract the .zip file inside of the input folder in the dashboard tool so to merge with the placeholder folders, then launch the whole system using the setup.sh, and turning it off with remove.sh. Some more information on how this works is specified in the specific repo:
https://github.com/ethan-puyaubreau/kokkos-energy-dashboard

More data outputs are on the way, as I am currently running this new benchmark on other platforms to check for more measurements/results.

@ethan-puyaubreau
Copy link
Contributor Author

The benchmark's length of around 4mins is intentional, allowing for the isolated execution of various tools to mitigate performance degradation during testing. This strategy would help account for the overhead of tools (e.g. execution time being 262.261 on NVML Energy Profiler and 262.078 on Variorum). Input file generation utilized input/generic_script.sh from kokkos-energy-dashboard, which repeatedly executes the same program with each tool for a set number of iterations.

@tpatki
Copy link

tpatki commented Jul 2, 2025

@ethan-puyaubreau
You're probably doing this correctly, but just as a sanity check:
Can you point me to where in the code you are calculating energy -- are you taking a delta between the two reported values?

Also, unlike power, "sampling" energy won't make sense, rather you'd need the equivalent of start_measurement and end_measurement around the region of interest (e.g. a function) and then take the delta across those two values from the API. The API returns the value in mJ since the driver was last loaded, another thing to check would be any error in the conversion there from mJ --> kJ.

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g732ab899b5bd18ac4bfb93c02de4900a

@ethan-puyaubreau
Copy link
Contributor Author

ethan-puyaubreau commented Jul 2, 2025

@tpatki Sure!

The relevant code is located here:
https://github.com/ethan-puyaubreau/kokkos-tools/blob/853bb2d7715a9e4039891ae2753d376dba170d40/profiling/energy-profiler/variorum/variorum_energy_profiler.cpp#L312-L350

I am indeed calculating a delta between two reported values for the Variorum based tool, but not for the nvmlDeviceGetTotalEnergyConsumption based tool as I'm extracting the raw value and adding it to the measurements.

Regarding the unit conversion, the Grafana dashboard is indeed set up to interpret the values as millijoules. So as you mentioned, it’s likely that nvmlDeviceGetTotalEnergyConsumption needs to be called at the start and end of each code region. I’ll definitely try that out and see how it behaves.

@ethan-puyaubreau ethan-puyaubreau changed the title Energy Consumption profiling tool NVML and Variorum based energy measurement tool for Kokkos Jul 12, 2025
Comment on lines +75 to +78
void power_monitoring_tick() {
if (!g_nvml_provider || !g_nvml_provider->is_initialized()) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They also used this line of code when it comes to directly getting the NVML_FI_DEV_POWER_INSTANT afaik

Copy link
Contributor

@JBludau JBludau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass


// --- Configuration ---
// The interval in milliseconds for power sampling.
constexpr int SAMPLING_INTERVAL_MS = 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set this based on the "Part time power measurements ..." paper

// --- Core Initialization ---
KernelTimerTool timer;

bool VERBOSE = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm ... I think this would rather be a CMake option than something that gets manually changed in the source code. And you could make it a compile time decision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, that would also allow for verbose to expand to energy measurement tools and not only the timer.

Comment on lines +46 to +54
struct EnergyTimer {
public:
void start_timing(uint64_t timing_id, RegionType type, std::string name);
void end_timing(uint64_t timing_id);
std::unordered_map<uint64_t, EnergyTiming>& get_timings();

private:
std::unordered_map<uint64_t, EnergyTiming> timings_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ... not sure if it is worth having the class and not just two free funcs and the unordered map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I can change that to something simpler.

uint64_t id = 0;
};

struct EnergyTiming {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you call it EnergyTiming?
I think I would rather have dedicated start and end or start at construction and something like now and reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a remnant of the hard coded version of the variorum provider system, that can definitely be changed.

namespace KokkosTools {
namespace Timer {

void export_kernels_csv(const std::deque<TimingInfo>& timings,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is a lot of repetition ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently working on refactoring this because indeed there is a lot of unnecessary repetition.

Comment on lines +25 to +40
#include <iostream>
#include <vector>
#include <string>
#include <chrono>
#include <mutex>
#include <iomanip>
#include <cmath>
#include <fstream>
#include <memory>

#include "kp_core.hpp"
#include "../common/daemon.hpp"
#include "../provider/provider_nvml.hpp"
#include "../common/filename_prefix.hpp"
#include "../common/timer.hpp"
#include "../tools/kernel_timer_tool.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

#include "../tools/kernel_timer_tool.hpp"

namespace KokkosTools {
namespace EnergyConsumption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of repetition ... maybe we should focus on the one we ended up using at the end


for (size_t i = 0; i < devices_.size(); ++i) {
double device_power = get_device_power_usage(i);
if (device_power >= 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to check that here? are negative values expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a remnant of NVML direct power measurement.


double NVMLProvider::get_device_power_usage(size_t device_index) {
if (!initialized_ || device_index >= devices_.size()) {
return -1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ... I see ... maybe you could use an optional for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the function definition or for the if condition? The if condition can definitely be simplified, the current version is the one sanity check I see pretty much everywhere when NVML is involved.

bool is_initialized() const { return initialized_; }

private:
bool initialized_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call it "in_working_state" since you assume that when it is false it is either finalized or was never initialized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name as of now.

@ethan-puyaubreau
Copy link
Contributor Author

Hello, this PR has been subdivided into multiple blocks: #299, #300, #301 and #302.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants