Open
Conversation
zhisbug
suggested changes
May 12, 2021
| checkpoint_suffix = 'c10' | ||
| checkpoint_name = checkpoint_dir + checkpoint_suffix | ||
| if IS_AUTODIST_CHIEF: | ||
| if IS_AUTODIST_CHIEF(): |
Contributor
There was a problem hiding this comment.
could you add a test case (e.g. case c11) that uses the above linear regression code plus ray backend so the CI can test against it every time when there is new case? You might want to add it to both single-node multi GPU test or distributed tests.
|
|
||
| def spawn_replica(replica_host, strategy_builder, strategy=None, env=None): | ||
| # Enforce actor placement on the provided host | ||
| runner = ray.remote(resources={f"node:{replica_host}": 0.01}, |
Contributor
There was a problem hiding this comment.
I believe this requires custom resource specification when you do ray up to start the ray cluster?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds RaySGD API to Autodist which enables it to train models on a Ray cluster. The API defines a
TFTrainerclass which takes a model creator, data creator, train step and a strategy builder and runs the training job on a distributed Ray cluster. The API follows the RaySGD API and is compatible with Ray Tune.Internally it implements a
TFRunnerclass which represents a replica. All communication between master and worker replicas happens through in-memory object store so there is no dependance on remote file system locations/accesses rights. Also ssh is not needed.Moreover the client code executed by each worker is also replicated using Ray eliminating the need of copying the model code to remote filesystems on each node. The users can run the example by installing Ray and running
$ python linear_regression_ray.py.Reference: https://docs.ray.io/en/master/raysgd/raysgd_tensorflow.html
Fixes #57