Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions profiling/energy-profiler/daemon.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
//@HEADER
// ************************************************************************
//
// Kokkos v. 4.0
// Copyright (2022) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions.
// See https://kokkos.org/LICENSE for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//@HEADER

#include "daemon.hpp"
#include <stdexcept>
#include <thread>

namespace KokkosTools::EnergyProfiler {
void Daemon::start() {
if (!running_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics of start() if it's already running? Should it break? Should it be restarted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should just do nothing if it is already running. Multiple start calls should be fine since in our current use case the global system time is recorded. If someone wants to use it differently I would opt we add a restart() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but as described below ... this is not threadsafe anyway

running_ = true;
thread_ = std::thread(&Daemon::run, this);
}
}

void Daemon::stop() {
if (running_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics here? Should we require that it's been running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it is not threadsafe, so we are limited with the current impl. But if we care we can either go to jthread or at least use atomics for the status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think C++20 std::jthread is a great idea here. I have a slight inclination to use C++ atomics for now since it seems more straightforward - I am wondering how portable/available jthreads would be on various computing systems currently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the tools don't have a minimum requirement for 20, right? Since most of the other tools are not threadsafe, I would say that is a larger discussion.
But if we want it to be std::jthread now, we can update. I am happy for any contribution

running_ = false;
thread_.join();
}
}

void Daemon::run() {
while (running_) {
auto next_run = std::chrono::high_resolution_clock::now() + interval_;
func_();
std::this_thread::sleep_until(next_run);
Comment on lines +38 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering about use cases where e.g. the time that the function takes is longer than the sampling interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never ment this to be able to run a function at an interval lower than its own execution time. That would be a much more complex daemon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Just thinking that in terms of strategy for the PR, it may make sense to clarify whether the PR is going for a general deamon, or for a deamon that works for the energy use case. Perhaps focusing on the energy use case would be helpful because it would give a good framework to think about issues like the choice of the tick for the duration, or this issue of the sampling frequency vs the duration of the API call.

If you narrow the focus to a daemon specific to the energy use case, the daemon could be placed into the KokkosTools::EnergyProfiler namespace for now.

That being said, I do think it would make sense to think about what happens when the API call takes longer than the sampling interval. Perhaps not systematically. But e.g. in the case where for whatever random reason, the API call happens to take a bit longer than expected.

Copy link
Contributor

@vlkale vlkale Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, let me preface by saying this work by @ethan-puyaubreau has been valuable in terms of capabilities for energy profiling and monitoring as well as ideas for software engineering that apply to other tools in Kokkos Tools.

With that, to hone in on your bigger picture question on a generic daemon class:

Assuming I have understood your use of the term Daemon as a real-time independent monitoring service, a key reason for my Kokkos Tools PR #291 for the Kokkos Tools LDMS connector is enabling this daemon-like functionality via a production-level performance monitoring software tool, called LDMS (Lightweight Distributed Monitoring Service). The motivation for it is that the Kokkos Tools LDMS connector can eventually integrated with - or more specifically be a parent tools library to - any other Kokkos Tools child/base library, e.g., nvtx-connector, variorium-connector. A variant of this LDMS connector had been used for the last couple years in production - it has been closed-source until this PR #291. (By the way, PR #291 is ready for review - feel free to provide suggestions for improvements and/or let me know if you have questions on it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it into the KokkosTools::EnergyProfiler namespace in f01b82e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}
}
} // namespace EnergyProfiler
} // namespace KokkosTools
44 changes: 44 additions & 0 deletions profiling/energy-profiler/daemon.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
//@HEADER
// ************************************************************************
//
// Kokkos v. 4.0
// Copyright (2022) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions.
// See https://kokkos.org/LICENSE for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//@HEADER

#ifndef KOKKOSP_ENERGY_PROFILER_DAEMON_HPP
#define KOKKOSP_ENERGY_PROFILER_DAEMON_HPP

#include <chrono>
#include <functional>
#include <thread>

namespace KokkosTools::EnergyProfiler {
class Daemon {
public:
Daemon(std::function<void()> func, const std::chrono::duration& interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick note is I don't see a default value for the time interval of the Daemon but maybe you could have one. If you do, I think a reasonable one would be 1 second. This would at least makes sense from the point of view of #291. Maybe there are other use cases and situations where you want it to be shorter (or longer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the user explicitly specifying a duration is a good practice here. It is a core setting of the daemon that should be thought about explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it - yes, I agree. We want this Daemon class to be generic beyond the energy profiling use case and/or LDMS use case, and it should indeed be thought out by users explicitly based on the scenario.

: interval_(interval), func_(std::move(func)){};

void start();
void stop();
bool is_running() const { return running_; }
std::thread& get_thread() { return thread_; }

private:
void run();
std::chrono::duration interval_;
bool running_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue of thread safety?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. As stated in a previous comment, we could just switch to jthread and https://en.cppreference.com/w/cpp/thread/jthread/request_stop.html once we have c++20

std::function<void()> func_;
std::thread thread_;
};
} // namespace EnergyProfiler
} // namespace KokkosTools
#endif
Loading