# RL-Benchmark

**Repository Path**: majingself/RL-Benchmark

## Basic Information

- **Project Name**: RL-Benchmark
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-11-21
- **Last Updated**: 2023-11-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# RL-Benchmark
An RL Benchmark. Easy to use, experiment, visualize, compare results, and extend.

# Supported Algorithms
- [ ] Model-free
    - [x] Value-based
        - [x] DQN (2013)
        - [x] Double DQN (2015)
        - [x] Dueling DQN (2015)
    - [ ] Policy-based
        - [x] Reinforce (1992)
        - [x] Actor-Critic (2000)
        - [ ] TRPO (2015)
        - [ ] DDPG (2015)
        - [ ] A2C (2016)
        - [ ] A3C (2016)
        - [ ] ACKTR (2017)
        - [ ] PPO (2017)
        - [ ] SAC (2018)
        - [ ] TD3 (2018)
        
- [ ] Model-based


# Usage
Use `python run.py -h` to see the available parameters and get help.

To run a **demo**, simply run `python run.py`.

# Benchmark Results

## Value-based
![](./result/DQN_CartPole-v1.png)

![](./result/DoubleDQN_CartPole-v1.png)

![](./result/DuelingDQN_CartPole-v1.png)

## Policy-based
![](./result/Reinforce_CartPole-v1.png)

![](./result/ActorCritic_CartPole-v1.png)

# Implementation Details and Tricks
## DQN
1. Replay buffer: use `deque`.
2. Target network: hard update (load state dict each N iterations).
3. Only one hidden layer.
4. DQN's update utilizes `gather` in pytorch.
5. Data type: `torch.float` and `np.float`.
6. Fixed epsilon.
7. Target update=100 is worse than Target update=10.

Param:
```
Namespace(batch_size=64, benchmark='DQN', device='cuda:0', env='CartPole-v1', epoch=500, epsilon=0.01, gamma=0.95, hidden=128, lr=0.002, max_capacity=1000, plot=True, save=True, seed=0, target_update=10)
```
## Double DQN
1. Just change the way we computer Q target: $$r + Q_{target}(s, a_{target}) -> r + Q_{target}(s, a_{origin})$$

Param:
```
Namespace(batch_size=64, benchmark='DoubleDQN', device='cuda:0', env='CartPole-v1', epoch=500, epsilon=0.01, gamma=0.95, hidden=128, lr=0.001, max_capacity=1000, plot=True, save=True, seed=0, target_update=10)
```

## Dueling DQN
1. Change the network (add advantage function layer and value function layer).

Param:
```
Namespace(batch_size=64, benchmark='DuelingDQN', device='cuda:0', env='CartPole-v1', epoch=500, epsilon=0.01, gamma=0.95, hidden=128, lr=0.001, max_capacity=1000, plot=True, save=True, seed=0, target_update=10)
```

## Reinforce
1. Use `action_dist = torch.distributions.Categorical(probs)` to take action.

Param:
```
Namespace(batch_size=64, benchmark='Reinforce', device='cuda:0', env='CartPole-v1', epoch=1000, epsilon=0.01, gamma=0.98, hidden=128, lr=0.001, max_capacity=1000, plot=True, save=True, seed=0, target_update=10)
```
## Actor-Critic
1. Use TD error as the factor of the objective in policy gradient.
2. Use `detach()` to cut off the back propagation. (very important!)
3. Two-timescale update. `1e-3` for the actor, `1e-2` for the critic. (effect the performance very much)

Param:
```
Namespace(actor_lr=0.001, batch_size=64, benchmark='ActorCritic', critic_lr=0.01, device='cuda:0', env='CartPole-v1', epoch=1000, epsilon=0.01, gamma=0.99, hidden=128, lr=0.001, max_capacity=1000, plot=True, save=True, seed=0, target_update=10)
```

# Reference
1. [动手学强化学习](https://hrl.boyuai.com/)