Zewen Li 2d416f7fa5 | 3 years ago | |
---|---|---|
Ray_examples | 4 years ago | |
.gitignore | 4 years ago | |
README.md | 4 years ago | |
pytorch_mnist.py | 3 years ago |
Agit provides multi-machine training environments. To utilize these machines, a simple way is through the Ray interface. The Ray framework has been configurated in Agit by default. Several simple examples are shown as follows.
In Ray, we can create and compute on objects. We refer to these objects as remote objects, and we use object IDs
to refer to them. Remote objects are stored in shared-memory object stores, and there is one object store per node in the cluster. In the cluster setting, we may not actually know which machine each object lives on.
Ray examples/ray_core-objects_in_Ray is an example shows how to operate objects in Ray.
Ray enables arbitrary Python functions to be executed asynchronously. These asynchronous Ray functions are called “remote functions (tasks)”. The standard way to turn a Python function into a remote function is to add the @ray.remote
decorator.
Ray examples/ray_core-remote_functions is an example of Ray tasks.
Actors extend the Ray API from functions (tasks) to classes. The @ray.remote
decorator indicates that instances of the class will be actors. An actor is essentially a stateful worker. Each actor runs in its own Python process.
Ray examples/ray_core-remote_classes is an example of Ray actors.
To specify a task’s CPU and GPU requirements, pass the num_cpus
and num_gpus
arguments into the remote decorator, like @ray.remote(num_cpus=2, num_gpus=0.5)
. The task will only run on a machine if there are enough CPU and GPU (and other custom) resources available to execute the task. Ray can also handle arbitrary custom resources.
You can specify resource requirements in actors too. When an actor decorated by @ray.remote(num_cpus=2, num_gpus=1)
is instantiated, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor’s lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates.
Ray examples/ray_core-specify_resources is an example of specifying CPUs and GPUs resources for Ray tasks and actors.
Tune is a Python library for experiment execution and hyperparameter tuning at any scale.
It is easy to use and supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.
Tune natively integrates with optimization libraries such as ZOOpt, Bayesian Optimization, and Facebook Ax.
This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.
Tune’s search algorithms are wrappers around open-source optimization libraries for efficient hyperparameter selection. Each library has a specific way of defining the search space - please refer to their documentation for more details.
Ray examples/tune-ZOOpt is an example of ZOOpt (a library for derivative-free optimization) that provides trial suggestions.
RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications.
RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.
At a high level, RLlib provides an Trainer
class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed.
Besides, some trainers with common reinforcement learning algorithms have been integrated in RLlib.
Ray examples/rllib-ppo is an example of training PPO trainer.
All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune.
Ray examples/rllib-with_tune is a simple hyperparameter sweep of PPO.