Conversation
|
@YLJALDC please address the comments above? |
|
I have addressed the comments. Do you have any further suggestions? |
Will review tomorrow and see whether more changes are needed. |
| # At AdaptDL mode, when the worker pass through this before | ||
| # the chief has created the strategy, this should returns | ||
| # nothing. Later, when the chief has created the strategy, | ||
| # it can load it. |
There was a problem hiding this comment.
Still not quite sure about the purpose of load and what this comment means. In L162-L163 load is always true when IS_ADAPTDL is true. Could you explain more?
There was a problem hiding this comment.
it is kind of subtle. Previously Autodist chief run first and generate the strategy; it will spawn worker instances after it builds the strategy, setup the cluster, etc. Now every instance will run through _build, and thus call _build_or_load_strategy. The first time the worker gets None from this function. The second time the worker will get the strategy from the chief. This is because kubernetes launch instances parallelly. The second time when the worker call the load, it is guaranteed that the chief has already generates it because there are several collective calls in between, which is blocking.
| self._coordinator.launch_clients() | ||
| else: | ||
| if IS_AUTODIST_CHIEF: | ||
| self._coordinator = Coordinator(strategy=strategy, cluster=self._cluster) |
There was a problem hiding this comment.
Would it be better if we create different Coordinator classes based on the cluster mode?
There was a problem hiding this comment.
Good suggestion. I tried similar format like you suggest. But I think the current version is more readable in autodist.py though more lengthy. Its easy to maintain this way since autodist.py is the first file to look at.
autodist/cluster.py
Outdated
| if IS_ADAPTDL: | ||
| hostname = socket.gethostname() | ||
| local_ip = socket.gethostbyname(hostname) | ||
| return local_ip |
There was a problem hiding this comment.
Since there is already a class named ADAPTDLCluster inherited from Cluster, is it necessary to insert ADAPTDL related code in the base class?
There was a problem hiding this comment.
Thanks for pointing this out. I have updated it to the ADAPTDLCluster Class. Thanks!
No description provided.