# RLs **Repository Path**: rayufo/RLs ## Basic Information - **Project Name**: RLs - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-12-16 - **Last Updated**: 2021-11-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RLs :evergreen_tree::evergreen_tree::evergreen_tree: Reinforcement Learning Algorithm Based On TensorFlow 2.x. This project includes SOTA or classic RL(reinforcement learning) algorithms used for training agents by interacting with Unity through [ml-agents](https://github.com/Unity-Technologies/ml-agents/tree/release_8) Release 8 or with [gym](https://github.com/openai/gym). The goal of this framework is to provide stable implementations of standard RL algorithms and simultaneously enable fast prototyping of new methods. ![](./pics/framework.jpg) ## About It aims to fill the need for a small, easily grokked codebase in which users can freely experiment with wild ideas (speculative research). ### Characteristics - Suitable for Windows, Linux, and OSX - Almost reimplementation and competitive performance of original papers - Reusable modules - Clear hierarchical structure and easy code control - Compatible with OpenAI Gym and Unity3D Ml-agents - Restoring the training process from where it stopped, retraining on a new task, fine-tuning - Using other training task's model as parameter initialization, specifying `--load` ### Supports This project supports: - Unity3D ml-agents. - Gym{MuJoCo, [PyBullet](https://github.com/bulletphysics/bullet3), [gym_minigrid](https://github.com/maximecb/gym-minigrid)}, for now only two data types are compatible——`[Box, Discrete]`. Support 99.65% environment settings of Gym(except `Blackjack-v0`, `KellyCoinflip-v0`, and `KellyCoinflipGeneralized-v0`). Support parallel training using gym envs, just need to specify `--copys` to how many agents you want to train in parallel. - Discrete -> Discrete (observation type -> action type) - Discrete -> Box - Box -> Discrete - Box -> Box - Box/Discrete -> Tuple(Discrete, Discrete, Discrete) - MultiAgent training. One group controls multiple agents. - MultiBrain training. Brains' model should be same algorithm or have the same learning-progress(perStep or perEpisode). - MultiImage input(only for ml-agents). Images will resized to same shape before store into replay buffer, like `[84, 84, 3]`. - Four types of Replay Buffer, Default is ER: - ER - n-step ER - [Prioritized ER](https://arxiv.org/abs/1511.05952) - n-step Prioritized ER - [Noisy Net](https://arxiv.org/abs/1706.10295) for better exploration. - [Intrinsic Curiosity Module](https://arxiv.org/abs/1705.05363) for almost all off-policy algorithms implemented. ### Advantages - Parallel training multiple scenes for Gym - Unified data format of environments between ml-agents and gym - Just need to write a single file for other algorithms' implementation(Similar algorithm structure). - Many controllable factors and adjustable parameters ## Installation ```bash $ git clone https://github.com/StepNeverStop/RLs.git $ cd RLs $ conda create -n rls python=3.6 $ conda activate rls # Windows $ pip install -e .[windows] # Linux or Mac OS $ pip install -e . ``` If using ml-agents: ```bash $ pip install -e .[unity] ``` If using atari: ```bash $ pip install -e .[atari] ``` You can download builded docker image from [here](https://hub.docker.com/r/keavnn/rls): ```bash $ docker pull keavnn/rls:latest ``` ## Implemented Algorithms For now, these algorithms are available: - Single-Agent training algorithms(Some algorithms that only support continuous space problems use Gumbel-softmax trick to implement discrete versions, i.e. DDPG): - Q-Learning, Sarsa, Expected Sarsa - :bug:Policy Gradient, PG - :bug:Actor Critic, AC - Advantage Actor Critic, A2C - [Trust Region Policy Optimization, TRPO](https://arxiv.org/abs/1502.05477) - :boom:[Proximal Policy Optimization, PPO](https://arxiv.org/abs/1707.06347) - [Deterministic Policy Gradient, DPG](https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf) - [Deep Deterministic Policy Gradient, DDPG](https://arxiv.org/abs/1509.02971) - :fire:[Soft Actor Critic, SAC](https://arxiv.org/abs/1812.05905) - [Tsallis Actor Critic, TAC](https://arxiv.org/abs/1902.00137) - :fire:[Twin Delayed Deep Deterministic Policy Gradient, TD3](https://arxiv.org/abs/1802.09477) - Deep Q-learning Network, DQN, [2013](https://arxiv.org/pdf/1312.5602.pdf), [2015](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) - [Double Deep Q-learning Network, DDQN](https://arxiv.org/abs/1509.06461) - [Dueling Double Deep Q-learning Network, DDDQN](https://arxiv.org/abs/1511.06581) - [Deep Recurrent Q-learning Network, DRQN](https://arxiv.org/abs/1507.06527) - [Deep Recurrent Double Q-learning, DRDQN](https://arxiv.org/abs/1908.06040) - [Category 51, C51](https://arxiv.org/abs/1707.06887) - [Quantile Regression DQN, QR-DQN](https://arxiv.org/abs/1710.10044) - [Implicit Quantile Networks, IQN](https://arxiv.org/abs/1806.06923) - [Rainbow DQN](https://arxiv.org/abs/1710.02298) - [MaxSQN](https://github.com/createamind/DRL/blob/master/spinup/algos/maxsqn/maxsqn.py) - [Soft Q-Learning, SQL](https://arxiv.org/abs/1702.08165) - [Bootstrapped DQN](http://arxiv.org/abs/1602.04621) - [Contrastive Unsupervised RL, CURL](http://arxiv.org/abs/2004.04136) - Hierachical training algorithms: - [Option-Critic, OC](http://arxiv.org/abs/1609.05140) - [Asynchronous Advantage Option-Critic, A2OC](http://arxiv.org/abs/1709.04571) - [PPO Option-Critic, PPOC](http://arxiv.org/abs/1712.00004) - [Interest-Option-Critic, IOC](http://arxiv.org/abs/2001.00271) - [HIerarchical Reinforcement learning with Off-policy correction, HIRO](http://arxiv.org/abs/1805.08296) - Multi-Agent training algorithms(*only Unity3D, not support visual input yet*): - [Multi-Agent Deep Deterministic Policy Gradient, MADDPG](https://arxiv.org/abs/1706.02275) - Safe Reinforcement Learning algorithms(*not stable yet*): - [Primal-Dual Deep Deterministic Policy Gradient, PD-DDPG](http://arxiv.org/abs/1802.06480) | Algorithms(29) | Discrete | Continuous | Image | RNN | Command parameter | | :-----------------------------: | :------: | :--------: | :---: | :--: | :---------------: | | Q-Learning/Sarsa/Expected Sarsa | √ | | | | qs | | PG | √ | √ | √ | | pg | | AC | √ | √ | √ | √ | ac | | A2C | √ | √ | √ | | a2c | | TRPO | √ | √ | √ | | trpo | | PPO | √ | √ | √ | | ppo | | DQN | √ | | √ | √ | dqn | | Double DQN | √ | | √ | √ | ddqn | | Dueling Double DQN | √ | | √ | √ | dddqn | | Bootstrapped DQN | √ | | √ | √ | bootstrappeddqn | | Soft Q-Learning | √ | | √ | √ | sql | | C51 | √ | | √ | √ | c51 | | QR-DQN | √ | | √ | √ | qrdqn | | IQN | √ | | √ | √ | iqn | | Rainbow | √ | | √ | √ | rainbow | | DPG | √ | √ | √ | √ | dpg | | DDPG | √ | √ | √ | √ | ddpg | | PD-DDPG | √ | √ | √ | √ | pd_ddpg | | TD3 | √ | √ | √ | √ | td3 | | SAC(has V network) | √ | √ | √ | √ | sac_v | | SAC | √ | √ | √ | √ | sac | | TAC | sac | √ | √ | √ | tac | | MaxSQN | √ | | √ | √ | maxsqn | | MADDPG | | √ | | √ | maddpg | | OC | √ | √ | √ | √ | oc | | AOC | √ | √ | √ | √ | aoc | | PPOC | √ | √ | √ | √ | ppoc | | IOC | √ | √ | √ | √ | ioc | | HIRO | √ | √ | | | hiro | | CURL | √ | √ | √ | | curl | ## Getting started ```python """ Usage: python [options] Options: -h,--help 显示帮助 -a,--algorithm= 算法 specify the training algorithm [default: ppo] -c,--copys= 指定并行训练的数量 nums of environment copys that collect data in parallel [default: 1] -e,--env= 指定Unity环境路径 specify the path of builded training environment of UNITY3D [default: None] -g,--graphic 是否显示图形界面 whether show graphic interface when using UNITY3D [default: False] -i,--inference 推断 inference the trained model, not train policies [default: False] -m,--models= 同时训练多少个模型 specify the number of trails that using different random seeds [default: 1] -n,--name= 训练的名字 specify the name of this training task [default: None] -p,--port= 端口 specify the port that communicate with training environment of UNITY3D [default: 5005] -r,--rnn 是否使用RNN模型 whether use rnn[GRU, LSTM, ...] or not [default: False] -s,--save-frequency= 保存频率 specify the interval that saving model checkpoint [default: None] -t,--train-step= 总的训练次数 specify the training step that optimize the policy model [default: None] -u,--unity 是否使用unity客户端 whether training with UNITY3D editor [default: False] --apex= i.e. "learner"/"worker"/"buffer"/"evaluator" [default: None] --unity-env= 指定unity环境的名字 specify the name of training environment of UNITY3D [default: None] --config-file= 指定模型的超参数config文件 specify the path of training configuration file [default: None] --store-dir= 指定要保存模型、日志、数据的文件夹路径 specify the directory that store model, log and others [default: None] --seed= 指定训练器全局随机种子 specify the random seed of module random, numpy and tensorflow [default: 42] --unity-env-seed= 指定unity环境的随机种子 specify the environment random seed of UNITY3D [default: 42] --max-step= 每回合最大步长 specify the maximum step per episode [default: None] --train-episode= 总的训练回合数 specify the training maximum episode [default: None] --train-frame= 总的训练采样次数 specify the training maximum steps interacting with environment [default: None] --load= 指定载入model的训练名称 specify the name of pre-trained model that need to load [default: None] --prefill-steps= 指定预填充的经验数量 specify the number of experiences that should be collected before start training, use for off-policy algorithms [default: None] --prefill-choose 指定no_op操作时随机选择动作,或者置0 whether choose action using model or choose randomly [default: False] --gym 是否使用gym训练环境 whether training with gym [default: False] --gym-env= 指定gym环境的名字 specify the environment name of gym [default: CartPole-v0] --gym-env-seed= 指定gym环境的随机种子 specify the environment random seed of gym [default: 42] --render-episode= 指定gym环境从何时开始渲染 specify when to render the graphic interface of gym environment [default: None] --info= 抒写该训练的描述,用双引号包裹 write another information that describe this training task [default: None] --use-wandb 是否上传数据到W&B whether upload training log to WandB [default: False] --hostname 是否在训练名称后附加上主机名称 whether concatenate hostname with the training name [default: False] --no-save 指定是否在训练中保存模型、日志及训练数据 specify whether save models/logs/summaries while training or not [default: False] Example: gym: python run.py --gym -a dqn --gym-env CartPole-v0 -c 12 -n dqn_cartpole --no-save unity: python run.py -u -a ppo -n run_with_unity python run.py -e /root/env/3dball.app -a sac -n run_with_execution_file """ ``` If you specify **gym**, **unity**, and **environment executable file path** simultaneously, the following priorities will be followed: gym > unity > unity_env. ## Notes 1. log, model, training parameter configuration, and data are stored in `C:\RLData` for Windows, or `$HOME/RLData` for Linux/OSX 2. maybe need to use command `su` or `sudo` to run on a Linux/OSX 3. record directory format is `RLData/Environment/Algorithm/Group name(for ml-agents)/Training name/config&log&model` 4. make sure brains' number > 1 if specifying `ma*` algorithms like maddpg 5. multi-agents algorithms doesn't support visual input and PER for now 6. **need 3 steps to implement a new algorithm** 1. write `.py` in `rls/algos/{single/multi/hierarchical}` directory and make the policy inherit from class `Policy`, `On_Policy`, `Off_Policy` or other super-class defined in `rls/algos/base` 2. write default configuration in `rls/algos/config.yaml` 3. register new algorithm at dictionary *algos* in `rls/algos/__init__.py`, make sure the class name matches the name of the algorithm class 7. set algorithms' hyper-parameters in [rls/algos/config.yaml](https://github.com/StepNeverStop/RLs/blob/master/rls/algos/config.yaml) 8. set training default configuration in [config.yaml](https://github.com/StepNeverStop/RLs/blob/master/config.yaml) 9. change neural network structure in [rls/nn/models.py](https://github.com/StepNeverStop/RLs/blob/master/rls/nn/models.py) 10. MADDPG is only suitable for Unity3D ML-Agents for now. group name in training scene should be set like `{agents control nums of this group per environment copy}#{others}`, i.e. `2#Agents` means one group controls two same agents in one environment copy. ## Ongoing things - DARQN - ACER - Ape-X - R2D2 - ~~ACKTR~~ ## Giving credit If using this repository for your research, please cite: ``` @misc{RLs, author = {Keavnn}, title = {RLs: Reinforcement Learning research framework for Unity3D and Gym}, year = {2019}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/StepNeverStop/RLs}}, } ``` ## Issues Any questions/errors about this project, please let me know in [here](https://github.com/StepNeverStop/RLs/issues/new).