-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-14432. SnapshotDeletingService incorrectly checks Ratis Buffer Limit #9656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes incorrect Ratis buffer limit checking in SnapshotDeletingService. The old implementation would add all snapshot-related entries to a request builder before checking if the message size exceeded the buffer limit, then attempted to recover by truncating the lists. This approach was flawed as it could create oversized messages.
Changes:
- Replaced the flawed size-checking logic with progressive batch building that checks size limits before adding each entry
- Introduced proper batching across deletedKeys, renamedKeys, and deletedDirs with early termination on failures
- Improved error handling by returning the count of successfully submitted entries for accurate retry tracking
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| long batchBytes = 0; | ||
| int totalSubmitted = 0; | ||
| int batchCount = 0; | ||
|
|
||
| for (SnapshotMoveKeyInfos key : deletedKeys) { | ||
| int keySize = key.getSerializedSize(); | ||
|
|
||
| // If adding this key would exceed the limit, flush the current batch first | ||
| if (batchBytes + keySize > ratisByteLimit && !currentDeletedKeys.isEmpty()) { | ||
| batchCount++; | ||
| LOG.debug("Submitting batch {} for snapshot {} with {} deletedKeys, {} renamedKeys, {} deletedDirs, " + | ||
| "size: {} bytes", batchCount, snapInfo.getTableKey(), currentDeletedKeys.size(), | ||
| currentRenamedKeys.size(), currentDeletedDirs.size(), batchBytes); | ||
|
|
||
| if (!submitSingleSnapshotMoveBatch(snapInfo, currentDeletedKeys, currentRenamedKeys, currentDeletedDirs)) { | ||
| return totalSubmitted; | ||
| } | ||
|
|
||
| totalSubmitted += currentDeletedKeys.size(); | ||
| currentDeletedKeys.clear(); | ||
| batchBytes = 0; | ||
| } | ||
|
|
||
| currentDeletedKeys.add(key); | ||
| batchBytes += keySize; | ||
| } |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch size calculation sums the serialized sizes of individual entries but doesn't account for the protocol buffer message overhead from SnapshotMoveTableKeysRequest and OMRequest wrappers. While the 10% safety margin in ratisByteLimit (line 113) may provide some buffer, consider explicitly accounting for base message overhead or documenting this limitation to ensure batches reliably stay under the limit even with many small entries.
| } | ||
|
|
||
| totalSubmitted += currentDeletedKeys.size(); | ||
| currentDeletedKeys.clear(); |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After flushing a batch during deletedKeys processing, only currentDeletedKeys is cleared (line 333), while currentRenamedKeys and currentDeletedDirs are not explicitly cleared. Although they should be empty at this point, for consistency and defensive programming, consider clearing all three lists after every flush to maintain a consistent state. This pattern is followed correctly in the later flush points (lines 357-359 and 382-385).
| currentDeletedKeys.clear(); | |
| currentDeletedKeys.clear(); | |
| currentRenamedKeys.clear(); | |
| currentDeletedDirs.clear(); |
What changes were proposed in this pull request?
SnapshotDeletingServiceincorrectly checks the Ratis Buffer Limit when we invokesubmitSnapshotMoveDeletedKeys, it adds all the keys to the builder and then checks the size of the message. Later, it again adds moredeletedKeys,deletedDirsandrenamedKeyson it, The message size is already over the ratis buffer limit, instead of creating a new builder we re-use the same builder. Ideally we should stop iterating through the snapshot-related tables when we hit the buffer limit even before going tosubmitSnapshotMoveDeletedKeysWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14432
How was this patch tested?
Existing Tests