You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Zewen Li 2d416f7fa5 Update 'pytorch_mnist.py' 3 years ago
Ray_examples fix bugs 4 years ago
.gitignore first commit 4 years ago
README.md Update 'README.md' 4 years ago
pytorch_mnist.py Update 'pytorch_mnist.py' 3 years ago

README.md

Distributed Computing Using Ray Framework

Agit provides multi-machine training environments. To utilize these machines, a simple way is through the Ray interface. The Ray framework has been configurated in Agit by default. Several simple examples are shown as follows.

Objects in Ray

In Ray, we can create and compute on objects. We refer to these objects as remote objects, and we use object IDs to refer to them. Remote objects are stored in shared-memory object stores, and there is one object store per node in the cluster. In the cluster setting, we may not actually know which machine each object lives on.

Ray examples/ray_core-objects_in_Ray is an example shows how to operate objects in Ray.

Remote functions (Tasks)

Ray enables arbitrary Python functions to be executed asynchronously. These asynchronous Ray functions are called “remote functions (tasks)”. The standard way to turn a Python function into a remote function is to add the @ray.remote decorator.

Ray examples/ray_core-remote_functions is an example of Ray tasks.

Remote classes (Actors)

Actors extend the Ray API from functions (tasks) to classes. The @ray.remote decorator indicates that instances of the class will be actors. An actor is essentially a stateful worker. Each actor runs in its own Python process.

Ray examples/ray_core-remote_classes is an example of Ray actors.

Specify resources

To specify a task’s CPU and GPU requirements, pass the num_cpus and num_gpus arguments into the remote decorator, like @ray.remote(num_cpus=2, num_gpus=0.5). The task will only run on a machine if there are enough CPU and GPU (and other custom) resources available to execute the task. Ray can also handle arbitrary custom resources.

You can specify resource requirements in actors too. When an actor decorated by @ray.remote(num_cpus=2, num_gpus=1) is instantiated, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor’s lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates.

Ray examples/ray_core-specify_resources is an example of specifying CPUs and GPUs resources for Ray tasks and actors.

Ray Tune

Tune is a Python library for experiment execution and hyperparameter tuning at any scale.

It is easy to use and supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.

Tune natively integrates with optimization libraries such as ZOOpt, Bayesian Optimization, and Facebook Ax.

Tune with grid search algorithm

This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.

Ray examples/tune-grid_search

Tune with advanced search algorithms

Tune’s search algorithms are wrappers around open-source optimization libraries for efficient hyperparameter selection. Each library has a specific way of defining the search space - please refer to their documentation for more details.

Ray examples/tune-ZOOpt is an example of ZOOpt (a library for derivative-free optimization) that provides trial suggestions.

Ray RLlib

RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications.

RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.

RLlib with PPO algorithm

At a high level, RLlib provides an Trainer class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed.

Besides, some trainers with common reinforcement learning algorithms have been integrated in RLlib.

Ray examples/rllib-ppo is an example of training PPO trainer.

RLlib with Tune

All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune.

Ray examples/rllib-with_tune is a simple hyperparameter sweep of PPO.