-
Notifications
You must be signed in to change notification settings - Fork 119
rocr: Correct gpu dumped core contents #2851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Updates ROCr GPU core dump generation (especially for regular files) and adds rocrtst coverage for configurable GPU core dump patterns and content validation.
Changes:
- Update core dump writer to use
pwritefor regular files and adjust size-limit handling/truncation behavior. - Add new rocrtst functional tests for GPU core dump patterns, disable flag, pipe patterns, and basic ELF/content integrity checks.
- Add a faulting-kernel test case (disabled by default) and wire new tests into the rocrtst test runner.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/rocr-runtime/runtime/hsa-runtime/libamdhsacode/lnx/amd_core_dump.cpp | Adds pwrite-based emission for regular files and changes size-limit truncation logic for core dump writing. |
| projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp | Adds a VM fault handler stderr print. |
| projects/rocr-runtime/rocrtst/suites/test_common/main.cc | Registers new GPU core dump tests and replaces the prior interrupt-disabled example test with a disabled faulting test. |
| projects/rocr-runtime/rocrtst/suites/functional/test_fault_example.h | Declares a new fault-inducing test case (disabled by default). |
| projects/rocr-runtime/rocrtst/suites/functional/test_fault_example.cc | Implements a kernel dispatch that intentionally passes null pointers to trigger a GPU fault. |
| projects/rocr-runtime/rocrtst/suites/functional/gpu_coredump.h | Declares a new test fixture for core dump pattern/content validation. |
| projects/rocr-runtime/rocrtst/suites/functional/gpu_coredump.cc | Implements multiple GPU core dump tests, including pattern matching, ELF validation, and segment checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
projects/rocr-runtime/runtime/hsa-runtime/libamdhsacode/lnx/amd_core_dump.cpp
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/runtime.cpp
Outdated
Show resolved
Hide resolved
| // RAII guard will cleanup all resources on exit | ||
| HSAResourceGuard resources; | ||
|
|
||
| // Initialize HSA | ||
| err = hsa_init(); | ||
| if (err != HSA_STATUS_SUCCESS) { | ||
| _exit(1); | ||
| } |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HSAResourceGuard RAII cleanup won’t run because the child uses _exit(...) on most paths, which bypasses destructors. Either avoid _exit when you want RAII cleanup (return from the function / call exit), or explicitly perform the required cleanup before _exit so resources/HSA shutdown behavior is deterministic.
fc8138f to
cf1aa9e
Compare
kentrussell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues from me aside from style things. Will let David chime in though
|
FYI, the tests pass when I run locally. But in CI, the forked processes fail to generate a fault and core, so the tests end up failing because a core isn't found. |
Includes several tests (rocrtst) for this capability.
de66a29 to
9c789b7
Compare
|
The new rocrtstFunc.GpuCoreDump_* tests pass in the PSDB tests, but other non-related tests are failing. |
Includes several tests (rocrtst) for this capability.
Motivation
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist