-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
oss/0External contributions priority 0External contributions priority 0pendingLabel for issues waiting a Datadog member's response.Label for issues waiting a Datadog member's response.team/agent-integrations
Description
Agent version
7.72.2
Bug Report
We are Aiven, a database-as-a-service provider. We noticed that a datadog-agent was constantly segmentation fault and restart on one of our customer's Kafka service.
This seems related to librdkafka-8fb48086.so.1 in /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka.libs/librdkafka-8fb48086.so.1 (build ID ddd8a17481071521b7a5b9e0edc9fff3ec594879)
We generated the GDB backtrace, below, using datadog-agent-dbg-7.72.2-1.x86_64.rpm
This only affect a small number of nodes, but would be great if it can be fixed in future agent version.
Thank you.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/datadog-agent/bin/agent/agent run -p /opt/datadog-agent/run/agent.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
[Current thread is 1 (Thread 0x7fee337e66c0 (LWP 71662))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /opt/datadog-agent/.debug/opt/datadog-agent/bin/agent/agent.dbg.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing rpms, try: dnf --enablerepo='*debug*' install datadog-agent-debuginfo-7.72.2-1.x86_64
(gdb) backtrace
#0 0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#1 0x00007fee849c6419 in rd_kafka_handle_OffsetFetch () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#2 0x00007fee84a58830 in rd_kafka_ListConsumerGroupOffsetsResponse_parse () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#3 0x00007fee84a84f67 in rd_kafka_admin_worker () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#4 0x00007fee84a5e15d in rd_kafka_admin_handle_response () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#5 0x00007fee849ad104 in rd_kafka_buf_callback () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#6 0x00007fee849b3ed0 in rd_kafka_op_handle_std () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#7 0x00007fee849b3f78 in rd_kafka_op_handle () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#8 0x00007fee849b08a4 in rd_kafka_q_serve0[localalias] () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#9 0x00007fee849b0c2b in rd_kafka_q_serve () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#10 0x00007fee8497cf53 in rd_kafka_thread_main () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#11 0x00007fee84a30747 in _thrd_wrapper_function () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
#12 0x00007feed0147f54 in start_thread (arg=<optimized out>) at pthread_create.c:448
#13 0x00007feed01cb32c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7fee337e66c0 (LWP 71662) 0x00007fee849f02a1 in rd_kafka_cgrp_op () from /opt/datadog-agent/embedded/lib/python3.13/site-packages/confluent_kafka/../confluent_kafka.libs/librdkafka-8fb48086.so.1
2 Thread 0x7fee8909f6c0 (LWP 69933) runtime.futex () at runtime/sys_linux_amd64.s:558
3 Thread 0x7fee3a7f46c0 (LWP 71686) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
4 Thread 0x7fee8809d6c0 (LWP 69935) runtime.futex () at runtime/sys_linux_amd64.s:558
5 Thread 0x7fee6f9e86c0 (LWP 70011) runtime.futex () at runtime/sys_linux_amd64.s:558
6 Thread 0x7fee39ff36c0 (LWP 71663) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
7 Thread 0x7fee6cf6c6c0 (LWP 70199) internal/runtime/syscall.Syscall6 () at internal/runtime/syscall/asm_linux_amd64.s:36
8 Thread 0x7fee3c7f86c0 (LWP 71675) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
9 Thread 0x7fee3effd6c0 (LWP 71689) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
10 Thread 0x7fee867da6c0 (LWP 69942) internal/runtime/syscall.Syscall6 () at internal/runtime/syscall/asm_linux_amd64.s:36
11 Thread 0x7fee8705b6c0 (LWP 69941) runtime.futex () at runtime/sys_linux_amd64.s:558
12 Thread 0x7fee59ffb6c0 (LWP 70223) runtime.futex () at runtime/sys_linux_amd64.s:558
13 Thread 0x7fee5bfff6c0 (LWP 70200) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
14 Thread 0x7fee6e4146c0 (LWP 70023) runtime.futex () at runtime/sys_linux_amd64.s:558
15 Thread 0x7fee3cff96c0 (LWP 71674) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
16 Thread 0x7fee3ffff6c0 (LWP 71671) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
17 Thread 0x7fee5b7fe6c0 (LWP 71666) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
18 Thread 0x7fee3f7fe6c0 (LWP 71688) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
19 Thread 0x7feecfd40780 (LWP 69930) runtime.futex () at runtime/sys_linux_amd64.s:558
20 Thread 0x7fee3d7fa6c0 (LWP 71676) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
21 Thread 0x7fee8889e6c0 (LWP 69934) runtime.futex () at runtime/sys_linux_amd64.s:558
22 Thread 0x7fee2b7d66c0 (LWP 71684) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
23 Thread 0x7fee2afd56c0 (LWP 71685) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
24 Thread 0x7fee3e7fc6c0 (LWP 71687) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
25 Thread 0x7fee5affd6c0 (LWP 71667) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
26 Thread 0x7fee5a7fc6c0 (LWP 71668) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
27 Thread 0x7fee597fa6c0 (LWP 71669) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
28 Thread 0x7fee58ff96c0 (LWP 71670) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
29 Thread 0x7fee33fe76c0 (LWP 71661) __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
30 Thread 0x7fee8789c6c0 (LWP 69936) runtime.futex () at runtime/sys_linux_amd64.s:558
31 Thread 0x7fee6d86d6c0 (LWP 70198) __strlen_avx2_rtm () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:77
we also see the following logs before datadog segmantation fault
datadog-agent: 2025-12-15 04:38:02 UTC | CORE | ERROR | (comp/metadata/inventorychecks/inventorychecksimpl/inventorychecks.go:304 in getFilesMetadata) | could not read files metadata
datadog-agent: 2025-12-15 04:39:21 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:143 in LogMessage) | kafka_consumer:15c8da990f5ee968 | (kafka_consumer.py:78) | There was a problem collecting the highwater mark offsets.
datadog-agent: Traceback (most recent call last):
datadog-agent: File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 70, in check
datadog-agent: highwater_offsets, cluster_id = self.get_highwater_offsets(consumer_offsets)
datadog-agent: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
datadog-agent: File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 383, in get_highwater_offsets
datadog-agent: for topic, partition, offset in self.client.consumer_offsets_for_times(
datadog-agent: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
datadog-agent: partitions=topic_partitions_to_check
datadog-agent: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datadog-agent: ):
datadog-agent: ^
datadog-agent: File "/opt/datadog-agent/embedded/lib/python3.13/site-packages/datadog_checks/kafka_consumer/client.py", line 105, in consumer_offsets_for_times
datadog-agent: for tp in self._consumer.offsets_for_times(
datadog-agent: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
datadog-agent: partitions=topicpartitions_for_querying, timeout=self.config._request_timeout
datadog-agent: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datadog-agent: )
datadog-agent: ^
datadog-agent: cimpl.KafkaException: KafkaError{code=FENCED_LEADER_EPOCH,val=74,str="Failed to get offsets: Broker: Leader epoch is older than broker epoch"}
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=publication_delivery-api-connector_DLT:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=push-topic_playout-preparator_DLT:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=ReplicaFetcherThread-0-31,topic=show_push-notification-service-connector_DLT,partition=1:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: at java.base/java.lang.Thread.run(Thread.java:1583)
datadog-agent: 2025-12-15 04:41:46 UTC | JMX | WARN | jmxfetch-recoveryPool-1 | Instance | Cannot get attributes or class name for bean kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=asset-fusion-service.event-livestream.order.v2:
datadog-agent: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null
datadog-agent: at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:583)
datadog-agent: at org.datadog.jmxfetch.Instance.init(Instance.java:455)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
datadog-agent: at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
datadog-agent: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
datadog-agent: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
datadog-agent: at java.base/java.lang.Thread.run(Thread.java:1583)
Reproduction Steps
No response
Agent configuration
No response
Operating System
Fedora 42, x86_64
Other environment details
No response
aiven-anton
Metadata
Metadata
Assignees
Labels
oss/0External contributions priority 0External contributions priority 0pendingLabel for issues waiting a Datadog member's response.Label for issues waiting a Datadog member's response.team/agent-integrations