# spartanbin_chess

**Repository Path**: xush_2020/spartanbin_chess

## Basic Information

- **Project Name**: spartanbin_chess
- **Description**: 用DQN和PPO两种强化学习方法教会机器下象棋。
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2022-08-04
- **Last Updated**: 2022-08-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


# ReinForcement Learning chess

Use the ReinForcement Learning optimization algorithm to let the machine learn playing chess.


## Abstract

This project uses two ReinForcement learning methods, DQN and PPO, to let the machine learn how to play chess. Among them, the deep learning model, action design, and feature engineering are almost the same as David Silver's [AlphaZero](https://www.science.org/doi/10.1126/science.aar6404). However, this project does not use Monte Carlo tree search, so we design a reward function to let the machine learn how to play chess. In addition: 

- Thanks to [gym-chess](https://github.com/iamlucaswolf/gym-chess), such a good chess simulation environment, it has done state feature engineering for AlphaZero.
- Thanks to [stable-baselines3](https://github.com/DLR-RM/stable-baselines3), it gave a lot of inspiration and reference in the implementation of PPO. 
- The goal of this project is to beat [stockfish 1](https://stockfishchess.org/), so far it has not been successful. 


## Requirements

- Win 10 or Win 11 (Because the internal stockfish engine is the windows version, replacing the Linux version can also run on Linux.) 
- numpy
- torch
- gym==0.17.3
- gym-chess==0.1.1
- stockfish==3.17.0

## Implemented Algorithms

| **Name**         | Deep Convolution Network | learn from stockfish | self-play learn | Remarks |
| ------------------- | ------------------ | ------------------ | ------------------ | ------------------ |
| DQN | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | The training was successful. Can defeat Rookie humans. |
| PPO | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | The training was successful. Not yet tested against human players. |

## Example

**DQN**

```python
from policies.DQN.DQN_algorithm import DQN_chess


# Run DQN for chess
rl_chess = DQN_chess(
    device='cuda',
)
# Load saving params
rl_chess.load_params()
# Sample and learn
rl_chess.learn(
    num_episodes=100000,
    method='by_teacher',
    batch_size=1024,
)
# Sample and save experiences from teacher
rl_chess.sampling_and_saving_teacher_experiences(
    num_of_2000_episodes=50,
)
# Learn from saving experiences
rl_chess.learning_from_saving_teacher_experiences(
    teacher_experiences_path=project_path + '/teacher_experiences/'
)
```

**PPO**

```python
from policies.PPO.PPO_algorithm import PPO_chess


# Run PPO for chess
rl_chess = PPO_chess(
    learning_rate=0.0001,
    batch_size=2048,
    n_epochs=10,
    gamma=1,
    gae_lambda=0.95,
    clip_range=0.2,
    clip_range_vf=None,
    ent_coef=0.1,
    vf_coef=0.5,
    max_grad_norm=0.5,
    seed=4000,
    device="cuda",
)
# Sample and learn
rl_chess.learn(
    num_episodes=100000,
    num_episodes_to_train=100,
    method='by_teacher',
)
```

## Filespec

- [game_with_man.ipynb](./jupyter_notebook/game_with_man.ipynb): DQN Plays game with human (you). 
- [policies](./policies/): RL policy model and stockfish program. 
- [simulation_environment.py](./simulation_environment.py): The environment to play chess and train RL policy model. 
- [main.py](./main.py): Run this script to train RL policy model. 
- DQN_params.pickle: Parameters of the saved DQN model, saved by python pickle, Decompressed from [DQN_params.zip](DQN_params.zip). 
- PPO_params.pickle: Parameters of the saved PPO model, saved by python pickle, Decompressed from [PPO_params.zip](PPO_params.zip). 

## Method

**Reward**

- beat Queen: 5
- beat Bishop, Knight: 3
- beat Rook: 2
- beat Pawn: 1
- win the game: 100
- draw: 0
- lose Queen: -5
- lose Bishop, Knight: -3
- lose Rook: -2 (I find that stockfish rarely use Rook, so give it 2 points.)
- lose Pawn: -1
- lose the game: -100

**Training data**

Data is currently collected from 100,000 stockfish vs stockfish games for training RL (DQN and PPO).

## Results

**DQN**

Can beat everyone around me (all new to chess). But can't beat stockfish 1. Possibly increase the number of training times and data, and then increase self-play learning, hopefully defeating stockfish 1.

**PPO**

- After the number of training increases, the parameters of the neural network may have null values, which will lead to errors, and the reason is unknown.
- Not yet tested against humans and DQN.