# 中山大学2022强化学习期中项目

**Repository Path**: ran-yuqi/sysu-rl-mid-term-project

## Basic Information

- **Project Name**: 中山大学2022强化学习期中项目
- **Description**: 该项目由中山大学2020级学生许展浩和王俊艺共同完成
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 2
- **Created**: 2023-06-05
- **Last Updated**: 2023-06-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Reinforcement Learning and Game Theory Midterm Peoject
| student number | name |
| --- | --- |
| 20337117 | 王俊艺 |
| 20337138 | 许展浩 |

## 1 INTRODUCTION
### 1.1 Breakout Game Description
In a Breakout game:
* A player is given paddle that it can move hoorizontally.
* At the biginnig of each turn, a ball drops down automatically from somewhere in the screen.
* The Paddle can be used to bounce back the ball.
* There are layers of bricks in the upper part of the screen.
* The player is awarded to destroy as many bricks as possible by hitting the bricks with the bouncy ball.
* The player is given 5 turns in each game.

### 1.2 Play the Breakout Game with DQN
Algorithm of nature DQN.
![DQN](DQN.png)

## 2 Detail in the Base Implementation
[Here](https://gitee.com/goluke/dqn-breakout) is the base implementation
### 2.1 main.py
There are some parameters defined firstly in this file.
```python
GAMMA = 0.99 # discount factor
GLOBAL_SEED = 0 
MEM_SIZE = 100_000 # total memory size
RENDER = False
SAVE_PREFIX = "./models"
STACK_SIZE = 4

# parameters about e-greedy
EPS_START = 1.
EPS_END = 0.1
EPS_DECAY = 1000000

BATCH_SIZE = 32
POLICY_UPDATE = 4 # the frequency of updating policy network
TARGET_UPDATE = 10_000 # the frequency of updating target network
WARM_STEPS = 50_000 # the number of warming steps
MAX_STEPS = 50_000_000  # total learning steps
EVALUATE_FREQ = 100_000 # the frquency of evaluating the reward
```
The three instances below are imported respectively from `utils_env.py`, `utils_drl.py` and  `utils_memory.py`.
```python
env = MyEnv(device)
agent = Agent(
    env.get_action_dim(),
    device,
    GAMMA,
    new_seed(),
    EPS_START,
    EPS_END,
    EPS_DECAY,
)
memory = ReplayMemory(STACK_SIZE + 1, MEM_SIZE, device)
```
And then there is the main loop of training process.
```python
#### Training ####
obs_queue: deque = deque(maxlen=5)
done = True

progressive = tqdm(range(MAX_STEPS), total=MAX_STEPS,
                   ncols=50, leave=False, unit="b")
for step in progressive:
    if done:
        observations, _, _ = env.reset()
        for obs in observations:
            obs_queue.append(obs)

    training = len(memory) > WARM_STEPS
    state = env.make_state(obs_queue).to(device).float()
    action = agent.run(state, training)
    obs, reward, done = env.step(action)
    obs_queue.append(obs)
    memory.push(env.make_folded_state(obs_queue), action, reward, done)

    if step % POLICY_UPDATE == 0 and training:
        agent.learn(memory, BATCH_SIZE)

    if step % TARGET_UPDATE == 0:
        agent.sync()

    if step % EVALUATE_FREQ == 0:
        avg_reward, frames = env.evaluate(obs_queue, agent, render=RENDER)
        with open("rewards.txt", "a") as fp:
            fp.write(f"{step//EVALUATE_FREQ:3d} {step:8d} {avg_reward:.1f}\n")
        if RENDER:
            prefix = f"eval_{step//EVALUATE_FREQ:03d}"
            os.mkdir(prefix)
            for ind, frame in enumerate(frames):
                with open(os.path.join(prefix, f"{ind:06d}.png"), "wb") as fp:
                    frame.save(fp, format="png")
        agent.save(os.path.join(
            SAVE_PREFIX, f"model_{step//EVALUATE_FREQ:03d}"))
        done = True
```
1. Check whether the game is done. If done, the game environment will be reset and return the observations which will be pushed in the observation queue.
2. The agent then determines an action according to the current state. After excuting the action we will get the newest observation, the reward, and a bool value indicating whether the episode is terminated.
3. Add observation to the observation queue and push the current state, action, reward and done into replay memory.
4. Update the policy network and target network respectively every defined steps.
5. When it is time to evaluate the agent, calculate and save the average reward after the agent plays the game serveral times. Render the process of game if needed, and save the model of agent's network.

### 2.2 utils_env.py
In this file class `MyEnv` is defined, which constitutes the environment of the breakout game. Several major functions are listed below.
1. `reset()`
    ```python
    def reset(
            self,
            render: bool = False,
    ) -> Tuple[List[TensorObs], float, List[GymImg]]:
        """reset resets and initializes the underlying gym environment."""
        self.__env.reset()
        init_reward = 0.
        observations = []
        frames = []
        for _ in range(5): # no-op
            obs, reward, done = self.step(0)
            observations.append(obs)
            init_reward += reward
            if done:
                return self.reset(render)
            if render:
                frames.append(self.get_frame())

        return observations, init_reward, frames
    ```
    This function resets and initializes the underlying gym environment. After 5 steps without operation, it returns the observation, initial reward and frames if needed.
2. `step()`
    ```python
    def step(self, action: int) -> Tuple[TensorObs, int, bool]:
        """step forwards an action to the environment and returns the newest
        observation, the reward, and an bool value indicating whether the
        episode is terminated."""
        action = action + 1 if not action == 0 else 0
        obs, reward, done, _ = self.__env.step(action)
        return self.to_tensor(obs), reward, done
    ```
    Step forward an action to the environment and return the newest observation, the reward, and an bool value indicating whether the episoce is terminated.
3. `evaluate()`
    ```python
    def evaluate(
            self,
            obs_queue: deque,
            agent: Agent,
            num_episode: int = 3,
            render: bool = False,
    ) -> Tuple[
        float,
        List[GymImg],
    ]:
        """evaluate uses the given agent to run the game for a few episodes and
        returns the average reward and the captured frames."""
        self.__env = self.__env_eval
        ep_rewards = []
        frames = []
        for _ in range(self.get_eval_lives() * num_episode):
            observations, ep_reward, _frames = self.reset(render=render)
            for obs in observations:
                obs_queue.append(obs)
            if render:
                frames.extend(_frames)
            done = False

            while not done:
                state = self.make_state(obs_queue).to(self.__device).float()
                action = agent.run(state, testing=True)
                obs, reward, done = self.step(action)

                ep_reward += reward
                obs_queue.append(obs)
                if render:
                    frames.append(self.get_frame())

            ep_rewards.append(ep_reward)

        self.__env = self.__env_train
        return np.sum(ep_rewards) / num_episode, frames
    ```
    This function evaluates the performance of the agent by let it play the for several times and returns the average reward and captured frames.

### 2.3 utils_drl.py
There is the definition of class `Agent` in this file. The `Agent` has 4 major functions.
1. `run()`
    ```python
    def run(self, state: TensorStack4, training: bool = False, testing: bool = False) -> int:
        """run suggests an action for the given state."""
        if training:
            self.__eps -= \
                (self.__eps_start - self.__eps_final) / self.__eps_decay
            self.__eps = max(self.__eps, self.__eps_final)

        if testing or self.__r.random() > self.__eps:
            with torch.no_grad():
                return self.__policy(state).max(1).indices.item()
        return self.__r.randint(0, self.__action_dim - 1)
    ```
    The function `run()` suggests an action for the given state according to the epsilon-greedy policy, and the parameter `__eps` of `Agent` decays each time .
2. `learn()`
    ```python
    def learn(self, memory: ReplayMemory, batch_size: int) -> float:
        """learn trains the value network via TD-learning."""
        state_batch, action_batch, reward_batch, next_batch, done_batch = \
            memory.sample(batch_size)

        values = self.__policy(state_batch.float()).gather(1, action_batch)
        values_next = self.__target(next_batch.float()).max(1).values.detach()
        expected = (self.__gamma * values_next.unsqueeze(1)) * \
            (1. - done_batch) + reward_batch
        loss = F.smooth_l1_loss(values, expected)

        self.__optimizer.zero_grad()
        loss.backward()
        for param in self.__policy.parameters():
            param.grad.data.clamp_(-1, 1)
        self.__optimizer.step()

        return loss.item()
    ```
    This function acquires a batch of imformation from memory pool, including state, action, reward, next state and done. We provide the policy network with the `state_batch` and then we get the `value`. We provide the target network with `next_batch` and then we get the `value_next`, which we use to compute the `expected`. With `value` and `expected` we can compute the `loss`, which we update the parameters of policy network according to.
3. `sync()`
    ```python
    def sync(self) -> None:
        """sync synchronizes the weights from the policy network to the target
        network."""
        self.__target.load_state_dict(self.__policy.state_dict())
    ```
    This function synchronizes the weights from the policy network to the target network.
4. `save()`
    ```python
    def save(self, path: str) -> None:
        """save saves the state dict of the policy network."""
        torch.save(self.__policy.state_dict(), path)
    ```
    This function simply saves the state dict of the policy network.

### 2.4 utils_memory.py
Class `ReplayMemory` implements the replay memory pool with a simple deque structure, using function `push()` to store the experiences of agent, and function `sample()` to randomly return a batch of information.

### 2.5 utils_model.py
Class `DQN` is defined in this file, implementing the deep Q-learning network with `torch` module, which contains three convolution layers followed by two full connected layers, and we use ReLU as activation function to connect each layers.

## 3 Our Improvements

In this experiment, we use Double DQN to improve the performance. [Here](https://gitee.com/jenduke/sysu-rl-mid-term-project.git) is our implmentation.

Q-learning algorithm adopts max_{a}q(s_{t+1},a) to update action values, which will lead to "maximization bias", resulting in a larger estimated action value.

Double DQN algorithm is used to solve the problem of overestimation in DQN algorithm. Overestimation means that the value function value estimated by value is larger than the real value function value. The reason lies in the maximization operation in Q-learning. This results in a large deviation in the final algorithm model. DDQN (Double DQN) solves the overestimation problem by decoupling the choice of the action of the target Q value and the calculation of the target Q value.

Compared with DQN algorithm, DDQN algorithm mainly aims at the overestimation problem of the latter and changes the calculation method of the target value. In other aspects, DDQN algorithm is completely consistent with DQN algorithm.

The difference between DQN and Double DQN is the calculation method of Q value.

DQN:

![1](pic/1.png)

* Input the new state directly into the old neural network and return the maximum Q value.

Double DQN:

![2](pic/2.png)

* Utilizing new neural networks to prevent overestimating.
* The new state is input into the new neural network, and the action of the maximum Q is obtained. This action is taken as the ordinate of the Q value selected by the old neural network.

As we can see above, Double DQN requires the construction of two action value functions, one for estimating the action and the other for estimating the value of the action. However, considering that there are already two networks in DQN algorithm: evaluation network and target network, DDQN algorithm only needs to use the evaluation network to determine the action and the target network to determine the value of the action when estimating the return, instead of building a new network.

The algorithm is as follows:

![3](pic/3.png)

```python
def learn(self, memory: ReplayMemory, batch_size: int) -> float:
    """learn trains the value network via TD-learning."""
    state_batch, action_batch, reward_batch, next_batch, done_batch = \
        memory.sample(batch_size)

    # Double DQN
    values = self.__policy(state_batch.float()).gather(1, action_batch)
    max_batch = self.__policy(next_batch.float()).argmax(1).unsqueeze(1)
    values_next = self.__target(next_batch.float()).gather(1, max_batch)
    
    expected = (self.__gamma * values_next) * \
        (1. - done_batch) + reward_batch
    loss = F.smooth_l1_loss(values, expected)

    self.__optimizer.zero_grad()
    loss.backward()
    for param in self.__policy.parameters():
        param.grad.data.clamp_(-1, 1)
    self.__optimizer.step()

    return loss.item()
```

## 4 Experiment Result
After more than 30,000,000 times training, we got our best double DQN model with reward up to 431.
![result](result.png)
And we compared the rewards during the training process.
![comparision](comparision.png)
We can notice that the performance of double DQN is better at the most time when the epoch is big enough.

## 5. Summary
There is our authorship matrix.
| Member | Ideas(%) | Codeing(%) | Writing(%) |
| --- | --- | --- | --- |
| 王俊艺 | 50% | 40% | 60% |
| 许展浩 | 50% | 60% | 40% |