Speed up check_kubernetes_service_replication#4020
Merged
Conversation
Member
Author
|
grr, i forgot that make test doesn't run mypy for some reason |
Member
Author
|
huh, what are our tests doing - mypy is showing me that this should have definitely broken some of the other checkers (e.g., the flink one) |
At some point this ran in <1min, and we have >1 things assuming this is still running in <1min...but the current version of this script actually takes multiple minutes in most of our clusters. I grabbed a couple flamegraphs of the current version of this (which I've since lost in my scrollback/browser history, but I can re-gen pretty easily) and noticed a couple obvious things: * we spend a significant amount of time getting pods - let's parallelize that! Note: we do this with multiprocessing and not multithreading since there's a lot of serialization happening of k8s objects that would likely hold onto the GIL * we were spending an obscene amount of time in filter_pods_by_service_instance() since we were calling it over and over again - let's group pods once in a smarter way and pass the grouping around instead :) * this one is kinda ???: socket.gethostbyaddr() was pretty prominent in the flamegraphs - we don't actually use the hostnames in this check, so let's add a way to skip that hostname resolution I also deleted the --additional-namespaces code since it's entirely unused now and it made things slightly cleaner.
6f67448 to
e09c741
Compare
EvanKrall
previously approved these changes
Mar 18, 2025
Member
EvanKrall
left a comment
There was a problem hiding this comment.
I'd prefer grouped_pods to be something more descriptive like pods_by_service_instance or something
…92-speedup-check_kubernetes_service_replication
Member
Author
|
I also tested the flink replication checker on infrastage after mypy alerted me that I had broken it :) |
Member
Author
@EvanKrall done! |
jfongatyelp
approved these changes
Apr 1, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
At some point this ran in <1min, and we have >1 things assuming this is still running in <1min...but the current version of this script actually takes multiple minutes in most of our clusters.
I grabbed a couple flamegraphs of the current version of this (which I've since lost in my scrollback/browser history, but I can re-gen pretty easily) and noticed a couple obvious things:
I also deleted the --additional-namespaces code since it's entirely unused now and it made things slightly cleaner.
EDIT: I regenerated the flamegraphs: