Examples of using Agit training
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

7.3 KB

AgitPythonExamples

Agit is a Git service and beyond. Agit provides collaborations from coding to model training.

This repository contains basic Python example codes running in the Agit training environment.

Note that currently Agit offers Python 3.7 environment.

Basic IO Operations

Print to console

All prints to the console can be viewed by clicking the console in the training page.

After the training, the console content is saved to the file agit_console.log placed in the training results fold. The fold is listed under the storage tab -> results tab.

basicIO/1-print_to_console is a single line Python example.

Read files

When the program runs in the Agit training environment, the current directory is the same as the directory in the Git repository. All files in the repository can be directly accessed.

basicIO/2-read_repo_file is a simple example of reading a repository file.

Write files

In the Agit training environment, all files write to path /root/.agit/ will move to the Storage -> Results after the training automatically.

After the training, the files are stored in the training folder named with training name and ID. The folder is listed under the Storage -> Results tab.

basicIO/3-write_file.py is a simple example of writing a file to the storage.

NOTE: Files that outside path /root/.agit/ will be abandoned after the training.

NOTE: Currently storage space soft limit is 1GB, soft limit will prevent running new training when reach the limit; hard limit is 2GB, hard limit will cause OSError in python code when reach the limit. Storage expansion service will be available in the future release.

Read datasets

The data files uploaded to the storage tab -> datasets tab can be accessed in the Agit training environment.

Agit currently provides simple way to read storage files.

  • first, override the open method, from agit import open
  • second, using the path agit:// as the root directory of the storage space.

basicIO/4-read_dataset.py is a simple example of reading a file from the storage.

Using TensorBoard

A built-in TensorBorad can track the training progress.

To use TensorBoard, the program needs some configurations to specifiy the path to be /root/.agit. Following code gives a simple example

basicIO/5-tensorboard.py

In the training page, the TensorBoard can be viewed by clicking tensorboard. After the training, the TensorBoard log file is moved to the storage.

Distributed Computing Using Ray Framework

Agit provides multi-machine training environments. To utilize these machines, a simple way is through the Ray interface. The Ray framework has been configurated in Agit by default. Several simple examples are shown as follows.

Objects in Ray

In Ray, we can create and compute on objects. We refer to these objects as remote objects, and we use object IDs to refer to them. Remote objects are stored in shared-memory object stores, and there is one object store per node in the cluster. In the cluster setting, we may not actually know which machine each object lives on.

Ray_examples/ray_core-objects_in_Ray is an example shows how to operate objects in Ray.

Remote functions (Tasks)

Ray enables arbitrary Python functions to be executed asynchronously. These asynchronous Ray functions are called “remote functions (tasks)”. The standard way to turn a Python function into a remote function is to add the @ray.remote decorator.

Ray_examples/ray_core-remote_functions is an example of Ray tasks.

Remote classes (Actors)

Actors extend the Ray API from functions (tasks) to classes. The @ray.remote decorator indicates that instances of the class will be actors. An actor is essentially a stateful worker. Each actor runs in its own Python process.

Ray_examples/ray_core-remote_classes is an example of Ray actors.

Specify resources

To specify a task’s CPU and GPU requirements, pass the num_cpus and num_gpus arguments into the remote decorator, like @ray.remote(num_cpus=2, num_gpus=0.5). The task will only run on a machine if there are enough CPU and GPU (and other custom) resources available to execute the task. Ray can also handle arbitrary custom resources.

You can specify resource requirements in actors too. When an actor decorated by @ray.remote(num_cpus=2, num_gpus=1) is instantiated, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor’s lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates.

Ray_examples/ray_core-specify_resources is an example of specifying CPUs and GPUs resources for Ray tasks and actors.

Ray Tune

Tune is a Python library for experiment execution and hyperparameter tuning at any scale.

It is easy to use and supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.

Tune natively integrates with optimization libraries such as ZOOpt, Bayesian Optimization, and Facebook Ax.

Tune with grid search algorithm

This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.

Ray_examples/tune-grid_search

Tune with advanced search algorithms

Tune’s search algorithms are wrappers around open-source optimization libraries for efficient hyperparameter selection. Each library has a specific way of defining the search space - please refer to their documentation for more details.

Ray_examples/tune-ZOOpt is an example of ZOOpt (a library for derivative-free optimization) that provides trial suggestions.

Ray RLlib

RLlib is an open-source library for reinforcement learning that offers both high scalability and a unified API for a variety of applications.

RLlib natively supports TensorFlow, TensorFlow Eager, and PyTorch, but most of its internals are framework agnostic.

RLlib with PPO algorithm

At a high level, RLlib provides an Trainer class which holds a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed.

Besides, some trainers with common reinforcement learning algorithms have been integrated in RLlib.

Ray_examples/rllib-ppo is an example of training PPO trainer.

RLlib with Tune

All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune.

Ray_examples/rllib-with_tune is a simple hyperparameter sweep of PPO.