agit developer d9adb0fbb2 | 4 years ago | |
---|---|---|
Ray_examples | 4 years ago | |
basicIO | 4 years ago | |
README.md | 4 years ago | |
repo_file.txt | 4 years ago |
Agit is a Git service and beyond. Agit provides collaborations from coding to model training.
This repository contains basic Python example codes running in the Agit training environment.
Note that currently Agit offers Python 3.7 environment.
All prints to the console can be viewed by clicking the console
in the training page.
After the training, the console content is saved to the file agit_console.log
placed in the training results fold. The fold is listed under the storage
tab -> results
tab.
basicIO/1-print_to_console is a single line Python example.
When the program runs in the Agit training environment, the current directory is the same as the directory in the Git repository. All files in the repository can be directly accessed.
basicIO/2-read_repo_file is a simple example of reading a repository file.
In the Agit training environment, all files writes to the path
/root/.agit/
will be automatically moved in the storage space after the training. Note that maximum 2GB storage is supported in the above path. More than 2GB storage will be supported latter.
After the training, the files are stored in the training fold named with training name and ID. The fold is listed under the storage
tab -> results
tab.
basicIO/3-write_file.py is a simple example of writing a file to the storage.
NOTE: files not in the path /root/.agit/
will be abandoned after the training.
The data files uploaded to the storage
tab -> datasets
tab can be accessed in the Agit training environment.
Agit currently provides simple way to read storage files.
open
method, from agit import open
agit://
as the root directory of the storage space.basicIO/4-read_dataset.py is a simple example of reading a file from the storage.
A built-in TensorBorad can track the training progress.
To use TensorBoard, the program needs some configurations to specifiy the path to be /root/.agit
. Following code gives a simple example
In the training page, the TensorBoard can be viewed by clicking tensorboard
. After the training, the TensorBoard log file is moved to the storage.
Agit provides multi-machine training environments. To utilize these machines, a simple way is through the Ray interface. The Ray framework has been configurated in Agit by default. Several simple examples are shown as follows.
In Ray, we can create and compute on objects. We refer to these objects as remote objects, and we use object IDs
to refer to them. Remote objects are stored in shared-memory object stores, and there is one object store per node in the cluster. In the cluster setting, we may not actually know which machine each object lives on.
Ray_examples/ray_core-objects_in_Ray is an example shows how to operate objects in Ray.
Ray enables arbitrary Python functions to be executed asynchronously. These asynchronous Ray functions are called “remote functions (tasks)”. The standard way to turn a Python function into a remote function is to add the @ray.remote
decorator.
Ray_examples/ray_core-remote_functions is an example of Ray tasks.
Actors extend the Ray API from functions (tasks) to classes. The @ray.remote
decorator indicates that instances of the class will be actors. An actor is essentially a stateful worker. Each actor runs in its own Python process.
Ray_examples/ray_core-remote_classes is an example of Ray actors.
To specify a task’s CPU and GPU requirements, pass the num_cpus
and num_gpus
arguments into the remote decorator, like @ray.remote(num_cpus=2, num_gpus=0.5)
. The task will only run on a machine if there are enough CPU and GPU (and other custom) resources available to execute the task. Ray can also handle arbitrary custom resources.
You can specify resource requirements in actors too. When an actor decorated by @ray.remote(num_cpus=2, num_gpus=1)
is instantiated, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor’s lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates.
Ray_examples/ray_core-specify_resources is an example of specifying CPUs and GPUs resources for Ray tasks and actors.
Tune is a Python library for experiment execution and hyperparameter tuning at any scale.
It is easy to use and supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.
Tune natively integrates with optimization libraries such as ZOOpt, Bayesian Optimization, and Facebook Ax.
This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.
Tune’s search algorithms are wrappers around open-source optimization libraries for efficient hyperparameter selection. Each library has a specific way of defining the search space - please refer to their documentation for more details.
Ray_examples/tune-ZOOpt is an example of ZOOpt (a library for derivative-free optimization) that provides trial suggestions.
RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications.
RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.
At a high level, RLlib provides an Trainer
class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed.
Besides, some trainers with common reinforcement learning algorithms have been integrated in RLlib.
Ray_examples/rllib-ppo is an example of training PPO trainer.
All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune.
Ray_examples/rllib-with_tune is a simple hyperparameter sweep of PPO.