-
Notifications
You must be signed in to change notification settings - Fork 251
file_util: improve rmtree performance #1672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Optimize `rmtree` to significantly reduce cleanup time, especially for large buildroots (e.g., from ~13 minutes for a ~2M-file tree down to ~1 minute). Profiling showed that a substantial amount of time was spent in the trace decorator invoked on every recursive `rmtree` call. The repeated logic has been moved into an internal helper without the decorator, cutting the call stack depth roughly in half and eliminating redundant `os.path.islink` checks and `if path in exclude` lookups. `os.listdir` was also replaced with `os.scandir`, improving memory efficiency and reducing `os.stat` calls.
| _recursive_rmtree(path, selinux, exclude) | ||
|
|
||
|
|
||
| def _recursive_rmtree(path, selinux, exclude): |
Check warning
Code scanning / vcs-diff-lint
_recursive_rmtree: Too many statements (51/50) Warning
| @@ -0,0 +1,334 @@ | |||
| import os | |||
Check warning
Code scanning / vcs-diff-lint
Missing module docstring Warning test
| universal_newlines=True) | ||
|
|
||
|
|
||
| def chattr_works_or_skip(path: Path): |
Check warning
Code scanning / vcs-diff-lint
chattr_works_or_skip: Missing function or method docstring Warning test
| try: | ||
| set_immutable(path, True) | ||
| set_immutable(path, False) | ||
| except subprocess.CalledProcessError as e: |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_regular_directory: Redefining name 'temp_dir' from outer scope (line 14) Warning test
| "nested": {} | ||
| } | ||
| } | ||
| create_dir_structure(temp_dir, struct) |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_nonexistent_directory: Redefining name 'temp_dir' from outer scope (line 14) Warning test
| file_util.rmtree(str(temp_dir)) | ||
| assert (readonly_dir / "file.txt").exists() | ||
|
|
||
| # Return write permission on readonly_dir |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_symlink_itself: Redefining name 'temp_dir' from outer scope (line 14) Warning test
| link = temp_dir / "link" | ||
| real_dir.mkdir() | ||
| os.symlink(real_dir, link) | ||
|
|
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_on_broken_symlink: Redefining name 'temp_dir' from outer scope (line 14) Warning test
| with pytest.raises(OSError, match="Cannot call rmtree on a symbolic link"): | ||
| file_util.rmtree(str(link)) | ||
|
|
||
| def test_rmtree_error_retry_simulated(self, temp_dir): |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_error_retry_simulated: Redefining name 'temp_dir' from outer scope (line 14) Warning test
|
|
||
| assert not temp_dir.exists() | ||
|
|
||
| def test_rmtree_long_path(self, temp_dir): |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_long_path: Redefining name 'temp_dir' from outer scope (line 14) Warning test
| if islonglongpath: | ||
| file_util.rmtree(str(temp_dir)) | ||
| assert not temp_dir.exists() | ||
| except OSError: |
Check warning
Code scanning / vcs-diff-lint
TestRmtree.test_rmtree_symlink_out: Redefining name 'temp_dir' from outer scope (line 14) Warning test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the performance of rmtree by refactoring the recursive logic into a helper function, avoiding a costly decorator on each recursive call. The use of os.scandir further enhances efficiency. The change is well-supported by a comprehensive new test suite, which is excellent. My review includes a few suggestions to improve the clarity and robustness of the new tests. Overall, this is a high-quality contribution that effectively addresses the performance bottleneck.
| def test_rmtree_error_retry_simulated(self, temp_dir): | ||
| """Simulate delayed deletion.""" | ||
| (temp_dir / "file.txt").write_text("will be deleted late") | ||
|
|
||
| # Monkey-patch os.remove to fail first few times | ||
| original_remove = os.remove | ||
| retries = 10 * 2 + 2 | ||
|
|
||
| def fake_remove(path): | ||
| nonlocal retries | ||
| if path == str(temp_dir / "file.txt") and retries: | ||
| retries -= 1 | ||
| if retries < 12: | ||
| raise OSError(errno.EBUSY, "tst EBUSY", path) | ||
| return | ||
| original_remove(path) | ||
|
|
||
| with patch("os.remove", fake_remove): | ||
| # Patch time.sleep to avoid long waits during retry | ||
| with patch("time.sleep"): | ||
| with pytest.raises(OSError, match="Directory not empty"): | ||
| file_util.rmtree(str(temp_dir)) | ||
| with pytest.raises(OSError, match="tst EBUSY"): | ||
| file_util.rmtree(str(temp_dir)) | ||
| file_util.rmtree(str(temp_dir)) | ||
|
|
||
| assert not temp_dir.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is clever in its use of a stateful counter to test multiple scenarios in sequence. However, this approach makes the test difficult to understand and brittle, as each assertion depends on the state left by the previous one. Best practices for testing recommend that test cases be independent.
Consider refactoring this into separate, more focused tests for each scenario (ENOTEMPTY failure, EBUSY failure, and successful retry). This would improve readability and maintainability.
For example, an isolated test for the ENOTEMPTY scenario could look like this:
def test_rmtree_enotempty_retry_fails(self, temp_dir):
"""Test that rmtree fails after exhausting retries on ENOTEMPTY."""
(temp_dir / "file.txt").write_text("data")
original_remove = os.remove
def fake_remove_noop(path):
if path == str(temp_dir / "file.txt"):
return # Do nothing to simulate a persistent file
original_remove(path)
with patch("os.remove", fake_remove_noop), patch("time.sleep"):
with pytest.raises(OSError) as excinfo:
file_util.rmtree(str(temp_dir))
assert excinfo.value.errno == errno.ENOTEMPTY| except OSError: | ||
| pytest.skip("Support of path > PATH_MAX not yet implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except OSError: can mask actual bugs in file_util.rmtree. If rmtree were to fail with an OSError for a reason other than the filesystem's lack of support for long paths, this test would be skipped incorrectly instead of failing.
To make the test more robust, the exception handling should be more specific. The try...except block should ideally only wrap the code that is expected to fail on certain systems (i.e., create_dir_tree), and it should catch specific error codes like errno.ENAMETOOLONG. An OSError raised from file_util.rmtree should generally be considered a test failure.
| @@ -0,0 +1 @@ | |||
| The `file_util.rmtree` cleanup process has been significantly accelerated, especially for very large buildroots. The previous approach could take over 13 minutes to remove ~2 million files. A new implementation reducing cleanup times to one minute on same data. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release note is clear, but the phrasing could be slightly more polished for the user-facing announcement. Consider this alternative for improved flow and conciseness:
"The file_util.rmtree cleanup process has been significantly accelerated, especially for very large buildroots. For example, cleanup time for a ~2M-file tree has been reduced from over 13 minutes to approximately one minute."
|
See also #1666. What if we just removed the decorator from affected calls? |
praiskup
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is simply too large (it appears to be generated) for me to provide a
comprehensive review. My time is limited, and we will likely close this soon.
We really need to discuss this with a human and have a dialogue.
Also, given the risk associated with a first-time GitHub contributor (not just Mock), we need to be careful about potential security threats (e.g., xz-like attacks).
Suggestion: If we know the root cause of the problem, let's implement a
minimal, few-line fix (avoiding the trace decorator) and evaluate the actual
benefits of such a large-scale change. The time spent on cleanups is not a
critical bottleneck—we often use tmpfs, minimal buildroots are truly minimal,
and source tarballs rarely contain millions of files. There is even an environment
variable available to experiment with this approach.
|
Ah yes, the modern AI-powered world - “it will make you more productive”, they said. This PR consists of a performance fix for The actual business-logic changes affect only 19 lines of code, mostly due to Python formatting. If we exclude moved code, the real changes amount to 11 lines, most of which are related to replacing Yes, the test skeleton was generated, but the polishing was done manually. The changes in Regarding the “dialog with a human” comment, if #1672 (comment) was addressed to me, then I didn’t realize that - apologies. My answer would be: how am I supposed to know how a change to the public API (decorator removing) of utility code will affect downstream code with whom I am not familiar? The statement about this being my first GitHub contribution is technically incorrect. While I contribute infrequently, this is not my first PR. Unfortunately, the size of the buildroot is caused not so much by sources as by build artifacts, this is a production environment. To summarize, before proceeding further I’d like clarification on the following points:
|
Optimize
rmtreeto significantly reduce cleanup time, especially for large buildroots (e.g., from ~13 minutes for a ~2M-file tree down to ~1 minute).Profiling showed that a substantial amount of time was spent in the trace decorator invoked on every recursive
rmtreecall.The repeated logic has been moved into an internal helper without the decorator, cutting the call stack depth roughly in half and eliminating redundant
os.path.islinkchecks andif path in excludelookups.os.listdirwas also replaced withos.scandir, improving memory efficiency and reducingos.statcalls.New implementation spends most of its time in syscalls.