Skip to content

Conversation

@cfreeamd
Copy link
Contributor

Includes several tests (rocrtst) for this capability.

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates ROCr GPU core dump generation (especially for regular files) and adds rocrtst coverage for configurable GPU core dump patterns and content validation.

Changes:

  • Update core dump writer to use pwrite for regular files and adjust size-limit handling/truncation behavior.
  • Add new rocrtst functional tests for GPU core dump patterns, disable flag, pipe patterns, and basic ELF/content integrity checks.
  • Add a faulting-kernel test case (disabled by default) and wire new tests into the rocrtst test runner.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
projects/rocr-runtime/runtime/hsa-runtime/libamdhsacode/lnx/amd_core_dump.cpp Adds pwrite-based emission for regular files and changes size-limit truncation logic for core dump writing.
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp Adds a VM fault handler stderr print.
projects/rocr-runtime/rocrtst/suites/test_common/main.cc Registers new GPU core dump tests and replaces the prior interrupt-disabled example test with a disabled faulting test.
projects/rocr-runtime/rocrtst/suites/functional/test_fault_example.h Declares a new fault-inducing test case (disabled by default).
projects/rocr-runtime/rocrtst/suites/functional/test_fault_example.cc Implements a kernel dispatch that intentionally passes null pointers to trigger a GPU fault.
projects/rocr-runtime/rocrtst/suites/functional/gpu_coredump.h Declares a new test fixture for core dump pattern/content validation.
projects/rocr-runtime/rocrtst/suites/functional/gpu_coredump.cc Implements multiple GPU core dump tests, including pattern matching, ELF validation, and segment checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 207 to 251
// RAII guard will cleanup all resources on exit
HSAResourceGuard resources;

// Initialize HSA
err = hsa_init();
if (err != HSA_STATUS_SUCCESS) {
_exit(1);
}
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HSAResourceGuard RAII cleanup won’t run because the child uses _exit(...) on most paths, which bypasses destructors. Either avoid _exit when you want RAII cleanup (return from the function / call exit), or explicitly perform the required cleanup before _exit so resources/HSA shutdown behavior is deterministic.

Copilot uses AI. Check for mistakes.
@cfreeamd cfreeamd force-pushed the cfreehil-gpucore-correction2 branch 3 times, most recently from fc8138f to cf1aa9e Compare January 26, 2026 14:59
Copy link
Contributor

@kentrussell kentrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues from me aside from style things. Will let David chime in though

@cfreeamd cfreeamd requested a review from lancesix January 26, 2026 17:23
@cfreeamd
Copy link
Contributor Author

cfreeamd commented Jan 26, 2026

FYI, the tests pass when I run locally. But in CI, the forked processes fail to generate a fault and core, so the tests end up failing because a core isn't found.
So I put in more prints to see what is failing in the child processes, and adjusted the tests to not fail if no fault occurred (this is an error with the test, or a test machine config issue vs with ROCr). My hope is that the prints I put in will explain why no fault occurs and I can adjust.

Includes several tests (rocrtst) for this capability.
@cfreeamd cfreeamd force-pushed the cfreehil-gpucore-correction2 branch from de66a29 to 9c789b7 Compare January 27, 2026 15:12
@cfreeamd
Copy link
Contributor Author

The new rocrtstFunc.GpuCoreDump_* tests pass in the PSDB tests, but other non-related tests are failing.
http://rocm-ci.amd.com//job/rocm-tests/38626/testReport/junit/tests/TestSuite/test_rocrtst/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants