Skip to content

Conversation

@perezjosibm
Copy link

@perezjosibm perezjosibm commented Nov 24, 2025

It would be useful to run gdb on coredump files in their originating machine to extract their backtrace and locals and pack these as part of the tarball.
Then, scrape.py will use this information as a quick prompt of the issue. This might save time when examining issues when cores have been dropped. This might need to run gdb on the remote machine. So it also needs to ensure that gdb has been installed, which might be restricted to -debug builds only.

@athanatos
Copy link
Contributor

This seems like a good idea.

rzarzynski
rzarzynski previously approved these changes Nov 25, 2025
Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very, very like the idea! It addresses a common headache: deploying a binary environment fully compatible with Teuthology (ceph-debug-docket.sh doesn't always work, sorry) just to run the gdb command for backtracking.

I think there is also demand for injecting other gdb cmds, more specific to particular investigation. I wonder whether having a facility to inject further commands through a task's yaml would be helpful. Customizing teuthology code is also doable but harder.

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>
Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>
@perezjosibm
Copy link
Author

I've tested the new code separated in a standalone to verify it does recognises correctly the types of compressed dumps:
ut_coredump
Next I'm going to verify within teuthology and the extensions to the unit tests

@perezjosibm perezjosibm marked this pull request as ready for review November 27, 2025 21:34
@perezjosibm perezjosibm requested a review from a team as a code owner November 27, 2025 21:34
@perezjosibm perezjosibm requested review from VallariAg and kamoltat and removed request for a team November 27, 2025 21:34
log.info(f'Getting backtrace from core {dump} ...')
with open(gdb_output_path, 'w') as gdb_out:
gdb_proc = subprocess.Popen(
['gdb', '--batch', '-ex', 'set pagination 0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like introduces another dependency for teuthology worker. Also what is gonna happen when teuthology is run in non-linux environment where gdb is not installed, for example, macosx?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kshtsk Precisely, I wanted to ask about the dependencies: where is the correct place to ensure the gdb package for the distro is being installed? How can we make this functionality optional, eg when using a debug build?

raise RuntimeError('Stale jobs detected, aborting.')


def get_backtraces_from_coredumps(coredump_path, dump_path, dump_program, dump):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in another method, could you please, use typing for arguments and method returns.
Also, the arguments description would be great to have in docstring.

@perezjosibm
Copy link
Author

I think the comment in ./teuthology/task/install/__init__.py:448: extra_packages: ['samba'] suggests the way to ensure gdb is installed in the nodes, which to my understanding that is part of the QA suite, not of the teuthology infrastructure, eg.
./ceph/qa/workunits/rgw/test_rgw_datacache.py 🤔
The following suggest that gdb is already installed:
/workunits/rados/test_crash.sh:15: gdb_output=$(echo "quit" | sudo gdb /usr/bin/ceph-osd $f)
will continue searching

@kshtsk
Copy link
Contributor

kshtsk commented Dec 1, 2025

I think the comment in ./teuthology/task/install/__init__.py:448: extra_packages: ['samba'] suggests the way to ensure gdb is installed in the nodes, which to my understanding that is part of the QA suite, not of the teuthology infrastructure, eg. ./ceph/qa/workunits/rgw/test_rgw_datacache.py 🤔 The following suggest that gdb is already installed: /workunits/rados/test_crash.sh:15: gdb_output=$(echo "quit" | sudo gdb /usr/bin/ceph-osd $f) will continue searching

If you're looking the place where the gdb is supposed to be installed on teuthology worker node, then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.
Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

@perezjosibm
Copy link
Author

Mmh, I'm getting the following failure when attempting to test:

(virtualenv) jjperez@teuthology:~/teuthology$ teuthology-suite -m smithi -s crimson-rados -p 101 --force-priority -c wip-perezjos-crimson-only-26-11-2025-PR66406 -f crimson-debug --distro centos --distro-version 9 --suite-branch main --suite-repo https://github.com/ceph/ceph.git -l 3 -t wip-perezjos-gdb-core --dry-run
2025-12-01 19:43:40,031.031 INFO:teuthology.suite:Using random seed=8401
2025-12-01 19:43:40,032.032 INFO:teuthology.suite.run:Checking for expiration (None)
2025-12-01 19:43:40,032.032 INFO:teuthology.suite.run:kernel sha1: distro
2025-12-01 19:43:40,631.631 INFO:teuthology.suite.run:ceph sha1: 96738ca7c8ea02a1e8bc1ef0617ca6d109469ac3
2025-12-01 19:43:40,800.800 INFO:teuthology.suite.run:ceph version: 20.3.0-4281.g96738ca7
2025-12-01 19:43:41,110.110 INFO:teuthology.suite.run:ceph branch: main 54f2c9b819ca94bdc5a064f9962de8f97bfcf4cc
2025-12-01 19:43:41,129.129 INFO:teuthology.repo_utils:Cloning https://github.com/ceph/ceph.git main from upstream
Traceback (most recent call last):
  File "/home/jjperez/teuthology/virtualenv/bin/teuthology-suite", line 7, in <module>
    sys.exit(main())
  File "/home/jjperez/teuthology/scripts/suite.py", line 232, in main
    return teuthology.suite.main(args)
  File "/home/jjperez/teuthology/teuthology/suite/__init__.py", line 138, in main
    run = Run(conf)
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 56, in __init__
    self.base_config = self.create_initial_config()
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 111, in create_initial_config
    teuthology_branch, teuthology_sha1 = self.choose_teuthology_branch()
  File "/home/jjperez/teuthology/teuthology/suite/run.py", line 322, in choose_teuthology_branch
    util.schedule_fail(message=str(exc), name=self.name, dry_run=self.args.dry_run)
  File "/home/jjperez/teuthology/teuthology/suite/util.py", line 77, in schedule_fail
    raise ScheduleFailError(message, name)
teuthology.exceptions.ScheduleFailError: Scheduling jjperez-2025-12-01_19:43:40-crimson-rados-wip-perezjos-crimson-only-26-11-2025-PR66406-distro-crimson-debug-smithi failed: Branch 'wip-perezjos-gdb-core' not found in repo: https://git.ceph.com/teuthology!

These are my remotes in my local checkout:

# git remote -v show
origin  git@github.com:perezjosibm/teuthology (fetch)
origin  git@github.com:perezjosibm/teuthology (push)
teutho  git@github.com:ceph/teuthology.git (fetch)
teutho  git@github.com:ceph/teuthology.git (push)

I pushed like this:
git push --no-verify -vvv -f --progress teutho wip-perezjos-gdb-core
Am I missing something?

@perezjosibm
Copy link
Author

Progressing (after empty commit to retrigger):
Screenshot 2025-12-02 at 13 15 29

@perezjosibm
Copy link
Author

@kshtsk
Copy link
Contributor

kshtsk commented Dec 2, 2025

The unit tests are failing.

@perezjosibm
Copy link
Author

perezjosibm commented Dec 3, 2025

If you're looking the place where the gdb is supposed to be installed on teuthology worker node,

Precisely, yes (assuming a worker node runs in the remote -- eg. smithi-- machine that originates the core when a Linux process aborts)

then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.

Ok, will look at that

Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

Yes, we would like to use it on the worker (remote) nodes that originate the coredump (eg. smithi machines), ideally when running debug builds. Does this mean that the functionality of this PR is better suited somewhere else (eg. ceph/qa) instead?

@kshtsk
Copy link
Contributor

kshtsk commented Dec 4, 2025

If you're looking the place where the gdb is supposed to be installed on teuthology worker node,

Precisely, yes (assuming a worker node runs in the remote -- eg. smithi-- machine that originates the core when a Linux process aborts)

then it should be added to the bootstrap script, and probably added to corresponding role in ceph-cm-ansible repo.

Ok, will look at that

Also maybe there is a sense to make this functionality optional, if we really don't want to use gdb in some configurations.

Yes, we would like to use it on the worker (remote) nodes that originate the coredump (eg. smithi machines), ideally when running debug builds. Does this mean that the functionality of this PR is better suited somewhere else (eg. ceph/qa) instead?

The teuthology worker machine is local node where dispatcher and supervisor are running, the remote node (like smithi) are remote nodes, the code in this PR is currently implementing when gdb is executed locally (via Popen). In order to run gdb on the remote (i.e. test target nodes) you might need to call Remote.sh() or Remote.run() methods. To make sure that remote nodes have the gdb installed, corresponding role from ceph-cm-ansible repo should be updated unless it doesn't have gdb dependency over there: https://github.com/ceph/ceph-cm-ansible/tree/main/roles/testnode
In order to make this functionality optional, the one might want to call gdb -v to see if gdb present and of which version before proceed with core dumps.

@perezjosibm
Copy link
Author

perezjosibm commented Dec 8, 2025

Ok, rewriting the approach:

  • I'll wrap the gdb invocation into a bash script that is executed on the remote, this will perform the relevant checks of which type of compression is used, etc. On success the script should produce a text file with the backtrace and locals info from gdb, this file will be placed in the same path as the location of the cores.
  • The unit test will simply mock the invocation of the script.
  • The Python code added that decompress the core is no longer required and would be removed.

Signed-off-by: Jose J Palacios-Perez <perezjos@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants