# 中山大学2022强化学习期中项目 **Repository Path**: ran-yuqi/sysu-rl-mid-term-project ## Basic Information - **Project Name**: 中山大学2022强化学习期中项目 - **Description**: 该项目由中山大学2020级学生许展浩和王俊艺共同完成 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 2 - **Created**: 2023-06-05 - **Last Updated**: 2023-06-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Reinforcement Learning and Game Theory Midterm Peoject | student number | name | | --- | --- | | 20337117 | 王俊艺 | | 20337138 | 许展浩 | ## 1 INTRODUCTION ### 1.1 Breakout Game Description In a Breakout game: * A player is given paddle that it can move hoorizontally. * At the biginnig of each turn, a ball drops down automatically from somewhere in the screen. * The Paddle can be used to bounce back the ball. * There are layers of bricks in the upper part of the screen. * The player is awarded to destroy as many bricks as possible by hitting the bricks with the bouncy ball. * The player is given 5 turns in each game. ### 1.2 Play the Breakout Game with DQN Algorithm of nature DQN. ![DQN](DQN.png) ## 2 Detail in the Base Implementation [Here](https://gitee.com/goluke/dqn-breakout) is the base implementation ### 2.1 main.py There are some parameters defined firstly in this file. ```python GAMMA = 0.99 # discount factor GLOBAL_SEED = 0 MEM_SIZE = 100_000 # total memory size RENDER = False SAVE_PREFIX = "./models" STACK_SIZE = 4 # parameters about e-greedy EPS_START = 1. EPS_END = 0.1 EPS_DECAY = 1000000 BATCH_SIZE = 32 POLICY_UPDATE = 4 # the frequency of updating policy network TARGET_UPDATE = 10_000 # the frequency of updating target network WARM_STEPS = 50_000 # the number of warming steps MAX_STEPS = 50_000_000 # total learning steps EVALUATE_FREQ = 100_000 # the frquency of evaluating the reward ``` The three instances below are imported respectively from `utils_env.py`, `utils_drl.py` and `utils_memory.py`. ```python env = MyEnv(device) agent = Agent( env.get_action_dim(), device, GAMMA, new_seed(), EPS_START, EPS_END, EPS_DECAY, ) memory = ReplayMemory(STACK_SIZE + 1, MEM_SIZE, device) ``` And then there is the main loop of training process. ```python #### Training #### obs_queue: deque = deque(maxlen=5) done = True progressive = tqdm(range(MAX_STEPS), total=MAX_STEPS, ncols=50, leave=False, unit="b") for step in progressive: if done: observations, _, _ = env.reset() for obs in observations: obs_queue.append(obs) training = len(memory) > WARM_STEPS state = env.make_state(obs_queue).to(device).float() action = agent.run(state, training) obs, reward, done = env.step(action) obs_queue.append(obs) memory.push(env.make_folded_state(obs_queue), action, reward, done) if step % POLICY_UPDATE == 0 and training: agent.learn(memory, BATCH_SIZE) if step % TARGET_UPDATE == 0: agent.sync() if step % EVALUATE_FREQ == 0: avg_reward, frames = env.evaluate(obs_queue, agent, render=RENDER) with open("rewards.txt", "a") as fp: fp.write(f"{step//EVALUATE_FREQ:3d} {step:8d} {avg_reward:.1f}\n") if RENDER: prefix = f"eval_{step//EVALUATE_FREQ:03d}" os.mkdir(prefix) for ind, frame in enumerate(frames): with open(os.path.join(prefix, f"{ind:06d}.png"), "wb") as fp: frame.save(fp, format="png") agent.save(os.path.join( SAVE_PREFIX, f"model_{step//EVALUATE_FREQ:03d}")) done = True ``` 1. Check whether the game is done. If done, the game environment will be reset and return the observations which will be pushed in the observation queue. 2. The agent then determines an action according to the current state. After excuting the action we will get the newest observation, the reward, and a bool value indicating whether the episode is terminated. 3. Add observation to the observation queue and push the current state, action, reward and done into replay memory. 4. Update the policy network and target network respectively every defined steps. 5. When it is time to evaluate the agent, calculate and save the average reward after the agent plays the game serveral times. Render the process of game if needed, and save the model of agent's network. ### 2.2 utils_env.py In this file class `MyEnv` is defined, which constitutes the environment of the breakout game. Several major functions are listed below. 1. `reset()` ```python def reset( self, render: bool = False, ) -> Tuple[List[TensorObs], float, List[GymImg]]: """reset resets and initializes the underlying gym environment.""" self.__env.reset() init_reward = 0. observations = [] frames = [] for _ in range(5): # no-op obs, reward, done = self.step(0) observations.append(obs) init_reward += reward if done: return self.reset(render) if render: frames.append(self.get_frame()) return observations, init_reward, frames ``` This function resets and initializes the underlying gym environment. After 5 steps without operation, it returns the observation, initial reward and frames if needed. 2. `step()` ```python def step(self, action: int) -> Tuple[TensorObs, int, bool]: """step forwards an action to the environment and returns the newest observation, the reward, and an bool value indicating whether the episode is terminated.""" action = action + 1 if not action == 0 else 0 obs, reward, done, _ = self.__env.step(action) return self.to_tensor(obs), reward, done ``` Step forward an action to the environment and return the newest observation, the reward, and an bool value indicating whether the episoce is terminated. 3. `evaluate()` ```python def evaluate( self, obs_queue: deque, agent: Agent, num_episode: int = 3, render: bool = False, ) -> Tuple[ float, List[GymImg], ]: """evaluate uses the given agent to run the game for a few episodes and returns the average reward and the captured frames.""" self.__env = self.__env_eval ep_rewards = [] frames = [] for _ in range(self.get_eval_lives() * num_episode): observations, ep_reward, _frames = self.reset(render=render) for obs in observations: obs_queue.append(obs) if render: frames.extend(_frames) done = False while not done: state = self.make_state(obs_queue).to(self.__device).float() action = agent.run(state, testing=True) obs, reward, done = self.step(action) ep_reward += reward obs_queue.append(obs) if render: frames.append(self.get_frame()) ep_rewards.append(ep_reward) self.__env = self.__env_train return np.sum(ep_rewards) / num_episode, frames ``` This function evaluates the performance of the agent by let it play the for several times and returns the average reward and captured frames. ### 2.3 utils_drl.py There is the definition of class `Agent` in this file. The `Agent` has 4 major functions. 1. `run()` ```python def run(self, state: TensorStack4, training: bool = False, testing: bool = False) -> int: """run suggests an action for the given state.""" if training: self.__eps -= \ (self.__eps_start - self.__eps_final) / self.__eps_decay self.__eps = max(self.__eps, self.__eps_final) if testing or self.__r.random() > self.__eps: with torch.no_grad(): return self.__policy(state).max(1).indices.item() return self.__r.randint(0, self.__action_dim - 1) ``` The function `run()` suggests an action for the given state according to the epsilon-greedy policy, and the parameter `__eps` of `Agent` decays each time . 2. `learn()` ```python def learn(self, memory: ReplayMemory, batch_size: int) -> float: """learn trains the value network via TD-learning.""" state_batch, action_batch, reward_batch, next_batch, done_batch = \ memory.sample(batch_size) values = self.__policy(state_batch.float()).gather(1, action_batch) values_next = self.__target(next_batch.float()).max(1).values.detach() expected = (self.__gamma * values_next.unsqueeze(1)) * \ (1. - done_batch) + reward_batch loss = F.smooth_l1_loss(values, expected) self.__optimizer.zero_grad() loss.backward() for param in self.__policy.parameters(): param.grad.data.clamp_(-1, 1) self.__optimizer.step() return loss.item() ``` This function acquires a batch of imformation from memory pool, including state, action, reward, next state and done. We provide the policy network with the `state_batch` and then we get the `value`. We provide the target network with `next_batch` and then we get the `value_next`, which we use to compute the `expected`. With `value` and `expected` we can compute the `loss`, which we update the parameters of policy network according to. 3. `sync()` ```python def sync(self) -> None: """sync synchronizes the weights from the policy network to the target network.""" self.__target.load_state_dict(self.__policy.state_dict()) ``` This function synchronizes the weights from the policy network to the target network. 4. `save()` ```python def save(self, path: str) -> None: """save saves the state dict of the policy network.""" torch.save(self.__policy.state_dict(), path) ``` This function simply saves the state dict of the policy network. ### 2.4 utils_memory.py Class `ReplayMemory` implements the replay memory pool with a simple deque structure, using function `push()` to store the experiences of agent, and function `sample()` to randomly return a batch of information. ### 2.5 utils_model.py Class `DQN` is defined in this file, implementing the deep Q-learning network with `torch` module, which contains three convolution layers followed by two full connected layers, and we use ReLU as activation function to connect each layers. ## 3 Our Improvements In this experiment, we use Double DQN to improve the performance. [Here](https://gitee.com/jenduke/sysu-rl-mid-term-project.git) is our implmentation. Q-learning algorithm adopts max_{a}q(s_{t+1},a) to update action values, which will lead to "maximization bias", resulting in a larger estimated action value. Double DQN algorithm is used to solve the problem of overestimation in DQN algorithm. Overestimation means that the value function value estimated by value is larger than the real value function value. The reason lies in the maximization operation in Q-learning. This results in a large deviation in the final algorithm model. DDQN (Double DQN) solves the overestimation problem by decoupling the choice of the action of the target Q value and the calculation of the target Q value. Compared with DQN algorithm, DDQN algorithm mainly aims at the overestimation problem of the latter and changes the calculation method of the target value. In other aspects, DDQN algorithm is completely consistent with DQN algorithm. The difference between DQN and Double DQN is the calculation method of Q value. DQN: ![1](pic/1.png) * Input the new state directly into the old neural network and return the maximum Q value. Double DQN: ![2](pic/2.png) * Utilizing new neural networks to prevent overestimating. * The new state is input into the new neural network, and the action of the maximum Q is obtained. This action is taken as the ordinate of the Q value selected by the old neural network. As we can see above, Double DQN requires the construction of two action value functions, one for estimating the action and the other for estimating the value of the action. However, considering that there are already two networks in DQN algorithm: evaluation network and target network, DDQN algorithm only needs to use the evaluation network to determine the action and the target network to determine the value of the action when estimating the return, instead of building a new network. The algorithm is as follows: ![3](pic/3.png) ```python def learn(self, memory: ReplayMemory, batch_size: int) -> float: """learn trains the value network via TD-learning.""" state_batch, action_batch, reward_batch, next_batch, done_batch = \ memory.sample(batch_size) # Double DQN values = self.__policy(state_batch.float()).gather(1, action_batch) max_batch = self.__policy(next_batch.float()).argmax(1).unsqueeze(1) values_next = self.__target(next_batch.float()).gather(1, max_batch) expected = (self.__gamma * values_next) * \ (1. - done_batch) + reward_batch loss = F.smooth_l1_loss(values, expected) self.__optimizer.zero_grad() loss.backward() for param in self.__policy.parameters(): param.grad.data.clamp_(-1, 1) self.__optimizer.step() return loss.item() ``` ## 4 Experiment Result After more than 30,000,000 times training, we got our best double DQN model with reward up to 431. ![result](result.png) And we compared the rewards during the training process. ![comparision](comparision.png) We can notice that the performance of double DQN is better at the most time when the epoch is big enough. ## 5. Summary There is our authorship matrix. | Member | Ideas(%) | Codeing(%) | Writing(%) | | --- | --- | --- | --- | | 王俊艺 | 50% | 40% | 60% | | 许展浩 | 50% | 60% | 40% |