-
Notifications
You must be signed in to change notification settings - Fork 69
NVML and Variorum based energy measurement tool for Kokkos #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
NVML and Variorum based energy measurement tool for Kokkos #295
Conversation
|
This work would be joined with #296 as this tool can exist as deamon (measuring GPU or any component with a specified interval) combined with a refactored version of the kernel timer that would allow access to a unified time measurement interface. |
|
Note: Some of the used mechanics such as |
|
After several changes made to the actual data model, the Python script made to generate plots of data is not relevant anymore. However, I made a self sufficient docker based tool using Grafana and PostgreSQL that allows for easy visualization of profiled data, available here: https://github.com/ethan-puyaubreau/kokkos-energy-dashboard Here are some screenshots of the actual interface: Please note that Grafana allows for infinite possibilities of data visualization (e.g. https://grafana.com/grafana/dashboards/) and therefore any ideas on what users want to see on the tool's dashboard are welcome. Currently, the docker tool only needs the Kokkos Tools CSV output files and doesn't need any hands-on experience with Docker/Grafana/PostgreSQL to access these graphs, allowing for an almost turn key tool. Profiling with more advanced programs is still in progress to find the best solutions to visualize profiled data in specific situations (high kernel number, etc). In the meantime, the profiler's architecture is now modular enough to allow for implementation of new data providers (such as PAPI), that would be allowing the introduction of CPU and multiple GPUs measurement. |
|
@ethan-puyaubreau Thanks for putting this together, and these example result screenshots look good. I agree on your note about higher fidelity performance data via PAPI. I would also suggest looking into profiling/logging data from GPU vendor tooling libraries, e.g., nvtx or CUPTI from NVIDIA. I assume you have discussed this with developers of Variorium, e.g., @tpatki Some of this may be related to LDMS for HPC Systems Monitoring. @vsurjadidjaja |
|
We hadn't heard of this, but this is cool to visualize on a dashboard! Thanks @vlkale for tagging me. We'd like to document and link this through Variorum as well once it is ready, that way other users can benefit from it. Which architectures has this been tested on and are there docs for users for it yet? Sorry, I haven't had a chance to look through the PR in detail. Tagging @slabasan, @kshoga1 and @rountree on this as well. We also have LDMS and Variorum integration @vsurjadidjaja @vlkale. That can be found here for those who want to use this: https://github.com/ovis-hpc/ldms/tree/main/ldms/src/contrib/sampler/variorum_sampler. |
|
Hi @tpatki, thanks for the feedback! For the first steps, I've been using my own computer with a Nvidia Ampere GPU, but I'm currently testing several new architectures to implement CPU profiling too. I would make a comprehensive documentation for users right after having made the right modifications for CPU profiling, though I would be interested to know what you consider sufficient of a documentation for users to be able to use this (haven't done a lot of complete documentation before, hence my question). |
|
Great to hear, we can also potentially help test on some of that architectures at our end. In terms of documentation, it'll be good to document (1) how to build/install the viz component with Grafana/PostgresSQL along with any dependencies needed for installation, (2) the different viewgraphs that are currently supported for the user, and (3) what architectures has it been tested on and expected to be supported on. I (and the Variorum team) can help with some of this as well, it'll be good to include the links and documentation in the main Variorum repo here. Maybe we can create another page ( One question I had was on interactive visualizations: does Grafana support that? I haven't used it much, hence the question. It will be cool to be able to zoom-in/zoom-out in timeline graphs, and maybe select per-component viz (I think you're only doing GPU energy at the moment, but we can easily extend to show CPU and Mem as well), and generate some other summary stats (you already have some in your viz, we can extend these). Happy to help brainstorm and also work on this as things progress. |
|
I would definitely appreciate the potential help with testing on other architectures. Looking forward to your detailed review of the PR and any suggestions you might have indeed. To make installation relatively straightforward, the entire visualization stack is currently encapsulated in a pre-configured To answer your question about interactive visualizations in Grafana: Yes, absolutely! Afaik Grafana is one of the most robust open source tools for this. You can definitely zoom in/out on timeline graphs, pan across the data, select specific time ranges for detailed inspection and filter data based on various parameters. Here's more information on the platform itself: https://grafana.com/ (don't mind their cloud solution, the whole system can be self-hosted and that's what I'm doing in this case) For instance, we can easily extend the current GPU energy visualization to allow users to select and view per-component data (CPU, memory, etc.) interactively. Grafana's dashboarding features also make it simple to add and display additional summary statistics right alongside the graphs. I'm really eager to brainstorm and collaborate on extending these capabilities as things progress. Thanks for offering your help! |
|
Hi! @tpatki I added one tool from NVML library (kp-nvml-energy) that leverages the |
|
Hi @ethan-puyaubreau What are you comparing your result from |
|
Hi @tpatki, I compared the data from the estimated energy integration from Here is my new test, using the benchmark Kokkos code of this current PR, Accessing Variorum Data: Accessing NVML directly via Both tools estimating energy around 6.2 or 6.3 kJ. As for sampling with Estimating the final energy consumed to around 24.4 kJ, nowhere near the other measurements. For more in depth analysis of the results, I can give you the raw data used for this dashboard (attached to this message) The code has run for only one iteration (so no mean values and no multiple batches). To use with the current dashboard, the only dependencies would be Python and Docker Compose, you would need to extract the More data outputs are on the way, as I am currently running this new benchmark on other platforms to check for more measurements/results. |
|
The benchmark's length of around 4mins is intentional, allowing for the isolated execution of various tools to mitigate performance degradation during testing. This strategy would help account for the overhead of tools (e.g. execution time being 262.261 on NVML Energy Profiler and 262.078 on Variorum). Input file generation utilized |
|
@ethan-puyaubreau Also, unlike power, "sampling" energy won't make sense, rather you'd need the equivalent of |
|
@tpatki Sure! The relevant code is located here: I am indeed calculating a delta between two reported values for the Variorum based tool, but not for the Regarding the unit conversion, the Grafana dashboard is indeed set up to interpret the values as millijoules. So as you mentioned, it’s likely that |
profiling/energy-profiler/energy-benchmark/src/energy_benchmark.cpp
Outdated
Show resolved
Hide resolved
2fd3ee4 to
48780b0
Compare
| void power_monitoring_tick() { | ||
| if (!g_nvml_provider || !g_nvml_provider->is_initialized()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They also used this line of code when it comes to directly getting the NVML_FI_DEV_POWER_INSTANT afaik
JBludau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass
|
|
||
| // --- Configuration --- | ||
| // The interval in milliseconds for power sampling. | ||
| constexpr int SAMPLING_INTERVAL_MS = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set this based on the "Part time power measurements ..." paper
| // --- Core Initialization --- | ||
| KernelTimerTool timer; | ||
|
|
||
| bool VERBOSE = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm ... I think this would rather be a CMake option than something that gets manually changed in the source code. And you could make it a compile time decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, that would also allow for verbose to expand to energy measurement tools and not only the timer.
| struct EnergyTimer { | ||
| public: | ||
| void start_timing(uint64_t timing_id, RegionType type, std::string name); | ||
| void end_timing(uint64_t timing_id); | ||
| std::unordered_map<uint64_t, EnergyTiming>& get_timings(); | ||
|
|
||
| private: | ||
| std::unordered_map<uint64_t, EnergyTiming> timings_; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ... not sure if it is worth having the class and not just two free funcs and the unordered map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can change that to something simpler.
| uint64_t id = 0; | ||
| }; | ||
|
|
||
| struct EnergyTiming { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you call it EnergyTiming?
I think I would rather have dedicated start and end or start at construction and something like now and reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a remnant of the hard coded version of the variorum provider system, that can definitely be changed.
| namespace KokkosTools { | ||
| namespace Timer { | ||
|
|
||
| void export_kernels_csv(const std::deque<TimingInfo>& timings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is a lot of repetition ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently working on refactoring this because indeed there is a lot of unnecessary repetition.
| #include <iostream> | ||
| #include <vector> | ||
| #include <string> | ||
| #include <chrono> | ||
| #include <mutex> | ||
| #include <iomanip> | ||
| #include <cmath> | ||
| #include <fstream> | ||
| #include <memory> | ||
|
|
||
| #include "kp_core.hpp" | ||
| #include "../common/daemon.hpp" | ||
| #include "../provider/provider_nvml.hpp" | ||
| #include "../common/filename_prefix.hpp" | ||
| #include "../common/timer.hpp" | ||
| #include "../tools/kernel_timer_tool.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| #include "../tools/kernel_timer_tool.hpp" | ||
|
|
||
| namespace KokkosTools { | ||
| namespace EnergyConsumption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of repetition ... maybe we should focus on the one we ended up using at the end
|
|
||
| for (size_t i = 0; i < devices_.size(); ++i) { | ||
| double device_power = get_device_power_usage(i); | ||
| if (device_power >= 0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to check that here? are negative values expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a remnant of NVML direct power measurement.
|
|
||
| double NVMLProvider::get_device_power_usage(size_t device_index) { | ||
| if (!initialized_ || device_index >= devices_.size()) { | ||
| return -1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ... I see ... maybe you could use an optional for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the function definition or for the if condition? The if condition can definitely be simplified, the current version is the one sanity check I see pretty much everywhere when NVML is involved.
| bool is_initialized() const { return initialized_; } | ||
|
|
||
| private: | ||
| bool initialized_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call it "in_working_state" since you assume that when it is false it is either finalized or was never initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the name as of now.







Hello,
I am submitting this draft pull request to share my progress on two separate energy profiling tools for Kokkos-based applications. Both tools leverage Kokkos profiling hooks (
kokkosp_begin/end_parallel_for,kokkosp_begin/end_parallel_reduce,kokkosp_begin/end_parallel_scan) to accurately record kernel start and end times, and at application finalization they produce structured outputs combining power readings with kernel durations.The first tool uses the Variorum-Kokkos connector (
profiling/variorum-connector), continuously sampling GPU power via Variorum at a 20 ms interval. This interval was chosen empirically as the maximum refresh rate supported by NVIDIA drivers (AMD exact refresh rate to be verified); below this threshold, the driver does not update power data. A 20 ms cadence ensures consistency with hardware and software constraints.The second tool relies on NVIDIA’s Management Library (NVML) API to query power draw directly. By avoiding the JSON parsing step required by Variorum, it achieves a simpler integration in NVIDIA-only environments, albeit at the cost of portability.
To date, both tools focus exclusively on GPU energy consumption. CPU measurement (e.g., via RAPL) is not yet supported due to permission and compatibility challenges. Similarly, the 20 ms sampling granularity may miss short-lived kernels (under 20 ms) when they occur in rapid succession. I am investigating estimation techniques based on FLOPS/Watt models, which correlate computational work to energy use via a GPU consumption matrix, to (try to) fill this gap.
GPUs also exhibit power-state transition behaviors depending on compute-bound versus memory-bound phases, introducing latency between power levels. I am still measuring workload characteristics with benchmark testing to estimate transition latencies. Comparing measured power profiles with these estimates will highlight any transition overhead as an “energy delta”, "wasted" in some way.
Here are some early graphs generated from data collected by the Variorum connector:
I also use Grafana dashboards for interactive visualization (example below), along with Python/Matplotlib scripts and Perfetto for post-processing. All tools consume a unified output format to ensure portability:
In a
bytes-flopsbenchmark designed to analyze GPU power transition latency, this plot illustrates how the GPU moves between power levels:All benchmarks I currently use, along with some post-processing scripts, can be found at: https://github.com/ethan-puyaubreau/kokkos-energy-benchmarks
This project remains a preliminary experiment. I welcome feedback on extending CPU support, improving sub-interval energy estimation, handling multi-GPU configurations, and refining power-transition models.