-
Notifications
You must be signed in to change notification settings - Fork 37
Description
System Info
- OS
- Ubuntu 22.04
- Windows
- ROS version and installation type
- rolling/source
- RTI Connext DDS version and installation type
- Connext Pro 6.0.1
- RMW version or commit hash
- rolling branch (a6053be as of this writing)
Bug Description
If you run the gtest_subscription__rmw_connextdds test from the test_rclcpp package with no load, it "always" passes. It also "always" seems to pass in CI. However, if you put load on the machine, it fails very often, maybe like 75% of the time.
Expected Behavior
The test always passes, even with a lot of load on the machine.
How to Reproduce
In terminal 1, put a lot of stress on the machine. In my case:
$ stress --cpu 16
In terminal 2, run the test:
$ colcon test --event-handlers console_direct+ --packages-select test_rclcpp --ctest-args -R gtest_subscription__rmw_connextdds
You may have to adjust the amount of stress on the machine, and you may have to run the test a few times, but it should fail fairly quickly.
Workarounds
Mark this test as xfail.
Additional context
This same test works fine on Fast-DDS (gtest_subscription__rmw_fastrtps_cpp) and Cyclone DDS (gtest_subscription__rmw_cyclonedds_cpp), even with load on the machine.
This has come up now because we are about to merge in ros2/rclcpp#2142, which seems to exacerbate this problem. However, I can produce this completely with rolling packages as of today, so it is not the fault of that PR.
I did some additional debugging to try to track this down. When it fails, the executor is waiting for new data to come in via rmw_wait via a condition variable:
| timedout = !this->condition.wait_for(lock, n, on_condition_active); |
That condition variable should be triggered when new data comes in via the on_data_available callback in the subscriber:
| RMW_Connext_DataReaderListener_on_data_available( |
| listener.on_data_available = |
Finally, I've verified that the publisher side is indeed writing the data out via DDS_DataWriter_write_untypedI in message (
| DDS_DataWriter_write_untypedI( |
So as far as I can tell, those are all of the pieces necessary to get this working, and it does work sometimes. But under load, it seems to fail. I could use some advice on how to debug this further.
In the meantime, I'm going to propose a PR to mark that particular test as xfail so we can make progress on the other PRs.