task: execute gdb on cores to extract backtrace and locals #2111

perezjosibm · 2025-11-24T21:10:59Z

It would be useful to run gdb on coredump files in their originating machine to extract their backtrace and locals and pack these as part of the tarball.
Then, scrape.py will use this information as a quick prompt of the issue. This might save time when examining issues when cores have been dropped. This might need to run gdb on the remote machine. So it also needs to ensure that gdb has been installed, which might be restricted to -debug builds only.

athanatos · 2025-11-25T16:12:37Z

This seems like a good idea.

rzarzynski

I very, very like the idea! It addresses a common headache: deploying a binary environment fully compatible with Teuthology (ceph-debug-docket.sh doesn't always work, sorry) just to run the gdb command for backtracking.

I think there is also demand for injecting other gdb cmds, more specific to particular investigation. I wonder whether having a facility to inject further commands through a task's yaml would be helpful. Customizing teuthology code is also doable but harder.

teuthology/task/internal/__init__.py

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>

perezjosibm · 2025-11-27T21:34:21Z

I've tested the new code separated in a standalone to verify it does recognises correctly the types of compressed dumps:

Next I'm going to verify within teuthology and the extensions to the unit tests

kshtsk · 2025-12-01T13:34:37Z

teuthology/task/internal/__init__.py

+    log.info(f'Getting backtrace from core {dump} ...')
+    with open(gdb_output_path, 'w') as gdb_out:
+        gdb_proc = subprocess.Popen(
+            ['gdb', '--batch', '-ex', 'set pagination 0',


This looks like introduces another dependency for teuthology worker. Also what is gonna happen when teuthology is run in non-linux environment where gdb is not installed, for example, macosx?

@kshtsk Precisely, I wanted to ask about the dependencies: where is the correct place to ensure the gdb package for the distro is being installed? How can we make this functionality optional, eg when using a debug build?

kshtsk · 2025-12-01T13:37:17Z

teuthology/task/internal/__init__.py

        raise RuntimeError('Stale jobs detected, aborting.')


+def get_backtraces_from_coredumps(coredump_path, dump_path, dump_program, dump):


Here and in another method, could you please, use typing for arguments and method returns.
Also, the arguments description would be great to have in docstring.

perezjosibm · 2025-12-01T15:45:03Z

I think the comment in ./teuthology/task/install/__init__.py:448: extra_packages: ['samba'] suggests the way to ensure gdb is installed in the nodes, which to my understanding that is part of the QA suite, not of the teuthology infrastructure, eg.
./ceph/qa/workunits/rgw/test_rgw_datacache.py 🤔
The following suggest that gdb is already installed:
/workunits/rados/test_crash.sh:15: gdb_output=$(echo "quit" | sudo gdb /usr/bin/ceph-osd $f)
will continue searching

kshtsk · 2025-12-01T18:14:13Z

I think the comment in ./teuthology/task/install/__init__.py:448: extra_packages: ['samba'] suggests the way to ensure gdb is installed in the nodes, which to my understanding that is part of the QA suite, not of the teuthology infrastructure, eg. ./ceph/qa/workunits/rgw/test_rgw_datacache.py 🤔 The following suggest that gdb is already installed: /workunits/rados/test_crash.sh:15: gdb_output=$(echo "quit" | sudo gdb /usr/bin/ceph-osd $f) will continue searching

If you're looking the place where the gdb is supposed to be installed on teuthology worker node, then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.
Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

perezjosibm · 2025-12-01T20:05:23Z

Mmh, I'm getting the following failure when attempting to test:

(virtualenv) jjperez@teuthology:~/teuthology$ teuthology-suite -m smithi -s crimson-rados -p 101 --force-priority -c wip-perezjos-crimson-only-26-11-2025-PR66406 -f crimson-debug --distro centos --distro-version 9 --suite-branch main --suite-repo https://github.com/ceph/ceph.git -l 3 -t wip-perezjos-gdb-core --dry-run
2025-12-01 19:43:40,031.031 INFO:teuthology.suite:Using random seed=8401
2025-12-01 19:43:40,032.032 INFO:teuthology.suite.run:Checking for expiration (None)
2025-12-01 19:43:40,032.032 INFO:teuthology.suite.run:kernel sha1: distro
2025-12-01 19:43:40,631.631 INFO:teuthology.suite.run:ceph sha1: 96738ca7c8ea02a1e8bc1ef0617ca6d109469ac3
2025-12-01 19:43:40,800.800 INFO:teuthology.suite.run:ceph version: 20.3.0-4281.g96738ca7
2025-12-01 19:43:41,110.110 INFO:teuthology.suite.run:ceph branch: main 54f2c9b819ca94bdc5a064f9962de8f97bfcf4cc
2025-12-01 19:43:41,129.129 INFO:teuthology.repo_utils:Cloning https://github.com/ceph/ceph.git main from upstream
Traceback (most recent call last):
  File "/home/jjperez/teuthology/virtualenv/bin/teuthology-suite", line 7, in <module>
    sys.exit(main())
  File "/home/jjperez/teuthology/scripts/suite.py", line 232, in main
    return teuthology.suite.main(args)
  File "/home/jjperez/teuthology/teuthology/suite/__init__.py", line 138, in main
    run = Run(conf)
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 56, in __init__
    self.base_config = self.create_initial_config()
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 111, in create_initial_config
    teuthology_branch, teuthology_sha1 = self.choose_teuthology_branch()
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 322, in choose_teuthology_branch
    util.schedule_fail(message=str(exc), name=self.name, dry_run=self.args.dry_run)
  File "/home/jjperez/teuthology/teuthology/suite/util.py", line 77, in schedule_fail
    raise ScheduleFailError(message, name)
teuthology.exceptions.ScheduleFailError: Scheduling jjperez-2025-12-01_19:43:40-crimson-rados-wip-perezjos-crimson-only-26-11-2025-PR66406-distro-crimson-debug-smithi failed: Branch 'wip-perezjos-gdb-core' not found in repo: https://git.ceph.com/teuthology!

These are my remotes in my local checkout:

# git remote -v show
origin  git@github.com:perezjosibm/teuthology (fetch)
origin  git@github.com:perezjosibm/teuthology (push)
teutho  git@github.com:ceph/teuthology.git (fetch)
teutho  git@github.com:ceph/teuthology.git (push)

I pushed like this:
git push --no-verify -vvv -f --progress teutho wip-perezjos-gdb-core
Am I missing something?

perezjosibm · 2025-12-02T13:16:37Z

Progressing (after empty commit to retrigger):

perezjosibm · 2025-12-02T14:26:11Z

All pass:
https://pulpito.ceph.com/jjperez-2025-12-02_13:15:05-crimson-rados-wip-perezjos-crimson-only-26-11-2025-PR66406-distro-crimson-debug-smithi/
Now to find a way to force a coredump and test the functionality

kshtsk · 2025-12-02T15:04:44Z

The unit tests are failing.

perezjosibm · 2025-12-03T10:32:59Z

If you're looking the place where the gdb is supposed to be installed on teuthology worker node,

Precisely, yes (assuming a worker node runs in the remote -- eg. smithi-- machine that originates the core when a Linux process aborts)

then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.

Ok, will look at that

Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

Yes, we would like to use it on the worker (remote) nodes that originate the coredump (eg. smithi machines), ideally when running debug builds. Does this mean that the functionality of this PR is better suited somewhere else (eg. ceph/qa) instead?

kshtsk · 2025-12-04T09:13:16Z

If you're looking the place where the gdb is supposed to be installed on teuthology worker node,

Precisely, yes (assuming a worker node runs in the remote -- eg. smithi-- machine that originates the core when a Linux process aborts)

then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.

Ok, will look at that

Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

Yes, we would like to use it on the worker (remote) nodes that originate the coredump (eg. smithi machines), ideally when running debug builds. Does this mean that the functionality of this PR is better suited somewhere else (eg. ceph/qa) instead?

The teuthology worker machine is local node where dispatcher and supervisor are running, the remote node (like smithi) are remote nodes, the code in this PR is currently implementing when gdb is executed locally (via Popen). In order to run gdb on the remote (i.e. test target nodes) you might need to call Remote.sh() or Remote.run() methods. To make sure that remote nodes have the gdb installed, corresponding role from ceph-cm-ansible repo should be updated unless it doesn't have gdb dependency over there: https://github.com/ceph/ceph-cm-ansible/tree/main/roles/testnode
In order to make this functionality optional, the one might want to call gdb -v to see if gdb present and of which version before proceed with core dumps.

perezjosibm · 2025-12-08T13:41:24Z

Ok, rewriting the approach:

I'll wrap the gdb invocation into a bash script that is executed on the remote, this will perform the relevant checks of which type of compression is used, etc. On success the script should produce a text file with the backtrace and locals info from gdb, this file will be placed in the same path as the location of the cores.
The unit test will simply mock the invocation of the script.
The Python code added that decompress the core is no longer required and would be removed.

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>

rzarzynski previously approved these changes Nov 25, 2025

View reviewed changes

teuthology/task/internal/__init__.py Outdated Show resolved Hide resolved

teuthology/task/internal/__init__.py Outdated Show resolved Hide resolved

teuthology/task/internal/__init__.py Outdated Show resolved Hide resolved

teuthology/task/internal/__init__.py Outdated Show resolved Hide resolved

perezjosibm added 2 commits November 27, 2025 21:26

task: execute gdb on cores to extract backtrace and locals

bc1fcd8

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>

Extend test_fetch_coredumps.py for zstd cases

c8c6d71

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>

perezjosibm dismissed rzarzynski’s stale review via c8c6d71 November 27, 2025 21:27

perezjosibm force-pushed the wip-perezjos-gdb-core branch from 44a3be8 to c8c6d71 Compare November 27, 2025 21:27

perezjosibm requested a review from rzarzynski November 27, 2025 21:32

perezjosibm marked this pull request as ready for review November 27, 2025 21:34

perezjosibm requested a review from a team as a code owner November 27, 2025 21:34

perezjosibm requested review from VallariAg and kamoltat and removed request for a team November 27, 2025 21:34

kshtsk reviewed Dec 1, 2025

View reviewed changes

Empty commit to retrigger github action test

4b75704

Replace local process invocations by corresponding remote.sh

e4d6b22

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>

		raise RuntimeError('Stale jobs detected, aborting.')


		def get_backtraces_from_coredumps(coredump_path, dump_path, dump_program, dump):

task: execute gdb on cores to extract backtrace and locals #2111

Are you sure you want to change the base?

task: execute gdb on cores to extract backtrace and locals #2111

Uh oh!

Conversation

perezjosibm commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

athanatos commented Nov 25, 2025

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perezjosibm commented Nov 27, 2025

Uh oh!

kshtsk Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

perezjosibm Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

kshtsk Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

perezjosibm commented Dec 1, 2025

Uh oh!

kshtsk commented Dec 1, 2025

Uh oh!

perezjosibm commented Dec 1, 2025

Uh oh!

perezjosibm commented Dec 2, 2025

Uh oh!

perezjosibm commented Dec 2, 2025

Uh oh!

kshtsk commented Dec 2, 2025

Uh oh!

perezjosibm commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kshtsk commented Dec 4, 2025

Uh oh!

perezjosibm commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perezjosibm commented Nov 24, 2025 •

edited

Loading

perezjosibm commented Dec 3, 2025 •

edited

Loading

perezjosibm commented Dec 8, 2025 •

edited

Loading