Skip to content

[BUG] Segmentation fault seems related to librdkafka-8fb48086.so.1 with GDB backtraceΒ #44442

@orange-kao

Description

@orange-kao

Agent version

7.72.2

Bug Report

We are Aiven, a database-as-a-service provider. We noticed that a datadog-agent was constantly segmentation fault and restart on one of our customer's Kafka service.

This seems related to librdkafka-8fb48086.so.1 in /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka.libs/librdkafka-8fb48086.so.1 (build ID ddd8a17481071521b7a5b9e0edc9fff3ec594879)

We generated the GDB backtrace, below, using datadog-agent-dbg-7.72.2-1.x86_64.rpm

This only affect a small number of nodes, but would be great if it can be fixed in future agent version.

Thank you.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/datadog-agent/bin/agent/agent run -p /opt/datadog-agent/run/agent.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
[Current thread is 1 (Thread 0x7fee337e66c0 (LWP 71662))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /opt/datadog-agent/.debug/opt/datadog-agent/bin/agent/agent.dbg.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing rpms, try: dnf --enablerepo='*debug*' install datadog-agent-debuginfo-7.72.2-1.x86_64
(gdb) backtrace
#0  0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#1  0x00007fee849c6419 in rd_kafka_handle_OffsetFetch () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#2  0x00007fee84a58830 in rd_kafka_ListConsumerGroupOffsetsResponse_parse () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#3  0x00007fee84a84f67 in rd_kafka_admin_worker () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#4  0x00007fee84a5e15d in rd_kafka_admin_handle_response () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#5  0x00007fee849ad104 in rd_kafka_buf_callback () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#6  0x00007fee849b3ed0 in rd_kafka_op_handle_std () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#7  0x00007fee849b3f78 in rd_kafka_op_handle () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#8  0x00007fee849b08a4 in rd_kafka_q_serve0[localalias] () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#9  0x00007fee849b0c2b in rd_kafka_q_serve () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#10 0x00007fee8497cf53 in rd_kafka_thread_main () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#11 0x00007fee84a30747 in _thrd_wrapper_function () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#12 0x00007feed0147f54 in start_thread (arg=<optimized out>) at pthread_create.c:448
#13 0x00007feed01cb32c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) info threads
  Id   Target Id                         Frame 
* 1    Thread 0x7fee337e66c0 (LWP 71662) 0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
  2    Thread 0x7fee8909f6c0 (LWP 69933) runtime.futex () at runtime/sys_linux_amd64.s:558
  3    Thread 0x7fee3a7f46c0 (LWP 71686) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  4    Thread 0x7fee8809d6c0 (LWP 69935) runtime.futex () at runtime/sys_linux_amd64.s:558
  5    Thread 0x7fee6f9e86c0 (LWP 70011) runtime.futex () at runtime/sys_linux_amd64.s:558
  6    Thread 0x7fee39ff36c0 (LWP 71663) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  7    Thread 0x7fee6cf6c6c0 (LWP 70199) internal/runtime/syscall.Syscall6 () at internal/runtime/syscall/asm_linux_amd64.s:36
  8    Thread 0x7fee3c7f86c0 (LWP 71675) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  9    Thread 0x7fee3effd6c0 (LWP 71689) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  10   Thread 0x7fee867da6c0 (LWP 69942) internal/runtime/syscall.Syscall6 () at internal/runtime/syscall/asm_linux_amd64.s:36
  11   Thread 0x7fee8705b6c0 (LWP 69941) runtime.futex () at runtime/sys_linux_amd64.s:558
  12   Thread 0x7fee59ffb6c0 (LWP 70223) runtime.futex () at runtime/sys_linux_amd64.s:558
  13   Thread 0x7fee5bfff6c0 (LWP 70200) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  14   Thread 0x7fee6e4146c0 (LWP 70023) runtime.futex () at runtime/sys_linux_amd64.s:558
  15   Thread 0x7fee3cff96c0 (LWP 71674) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  16   Thread 0x7fee3ffff6c0 (LWP 71671) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  17   Thread 0x7fee5b7fe6c0 (LWP 71666) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  18   Thread 0x7fee3f7fe6c0 (LWP 71688) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  19   Thread 0x7feecfd40780 (LWP 69930) runtime.futex () at runtime/sys_linux_amd64.s:558
  20   Thread 0x7fee3d7fa6c0 (LWP 71676) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  21   Thread 0x7fee8889e6c0 (LWP 69934) runtime.futex () at runtime/sys_linux_amd64.s:558
  22   Thread 0x7fee2b7d66c0 (LWP 71684) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  23   Thread 0x7fee2afd56c0 (LWP 71685) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  24   Thread 0x7fee3e7fc6c0 (LWP 71687) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  25   Thread 0x7fee5affd6c0 (LWP 71667) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  26   Thread 0x7fee5a7fc6c0 (LWP 71668) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  27   Thread 0x7fee597fa6c0 (LWP 71669) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  28   Thread 0x7fee58ff96c0 (LWP 71670) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  29   Thread 0x7fee33fe76c0 (LWP 71661) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
  30   Thread 0x7fee8789c6c0 (LWP 69936) runtime.futex () at runtime/sys_linux_amd64.s:558
  31   Thread 0x7fee6d86d6c0 (LWP 70198) __strlen_avx2_rtm () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:77

we also see the following logs before datadog segmantation fault

datadog-agent: 2025-12-15 04:38:02 UTC | CORE | ERROR | (comp/metadata/inventorychecks/inventorychecksimpl/inventorychecks.go:304 in getFilesMetadata) | could not read files metadata
datadog-agent: 2025-12-15 04:39:21 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:143 in LogMessage) | kafka_consumer:15c8da990f5ee968 | (kafka_consumer.py:78) | There was a problem collecting the highwater mark offsets.
datadog-agent: Traceback (most recent call last):
datadog-agent:   File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 70, in check
datadog-agent:     highwater_offsets, cluster_id = self.get_highwater_offsets(consumer_offsets)
datadog-agent:                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
datadog-agent:   File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 383, in get_highwater_offsets
datadog-agent:     for topic, partition, offset in self.client.consumer_offsets_for_times(
datadog-agent:                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
datadog-agent:         partitions=topic_partitions_to_check
datadog-agent:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datadog-agent:     ):
datadog-agent:     ^
datadog-agent:   File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/client.py", line 105, in consumer_offsets_for_times
datadog-agent:     for tp in self._consumer.offsets_for_times(
datadog-agent:               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
datadog-agent:         partitions=topicpartitions_for_querying, timeout=self.config._request_timeout
datadog-agent:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datadog-agent:     )
datadog-agent:     ^
datadog-agent: cimpl.KafkaException: KafkaError{code=FENCED_LEADER_EPOCH,val=74,str="Failed to get offsets: Broker: Leader epoch is older than broker epoch"}
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=publication_delivery-api-connector_DLT:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: 	at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: 	at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: 	at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=push-topic_playout-preparator_DLT:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: 	at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: 	at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: 	at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ReplicaFetcherThread-0-31,topic=show_push-notification-service-connector_DLT,partition=1:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: 	at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: 	at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: 	at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=asset-fusion-service.event-livestream.order.v2:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: 	at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: 	at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: 	at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: 	at java.base/java.lang.Thread.run(Thread.java:1583)

Reproduction Steps

No response

Agent configuration

No response

Operating System

Fedora 42, x86_64

Other environment details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    oss/0External contributions priority 0pendingLabel for issues waiting a Datadog member's response.team/agent-integrations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions