# RL

**Repository Path**: bigduduwxf/RL

## Basic Information

- **Project Name**: RL
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-20
- **Last Updated**: 2026-01-20

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# RL
Reinforcement learning

- 记录强化学习自学的相关东西
  - github pag [https://scchy.github.io/RL](https://scchy.github.io/RL)
- 强化学习实践的项目
- 强化学习的框架 在 [src 目录](./src/)下


# 框架简介
目录
```text
src
├── LLMRL
│   ├── DataWhale-R1.yaml
│   ├── train_DataWhale-R1.py
│   └── train_DataWhale-R1.sh
├── RAGRL
│   ├── api_chat_test.py
│   ├── data
│   ├── rag.md
│   ├── rag_utils.py
│   └── simple_rl_rag.py
├── requestment.txt
├── RLAlgo
│   ├── A2C.py
│   ├── _base_net.py
│   ├──batchRL
│   │   └──  cql.py
│   ├── DDPG.py
│   ├── DQN.py
│   ├── grad_ana.py
│   ├── ICM.py
│   ├── PPO2_old.py
│   ├── PPO2.py
│   ├── PPO.py
│   ├── SAC.py
│   ├── SoftQNew.py
│   ├── SoftQ.py
│   └── TD3.py
├── RLUtils
│   ├── batchRL
│   │   ├── trainer.py
│   │   └── utils.py
│   ├── config.py
│   ├── env_wrapper.py
│   ├── __init__.py
│   ├── memory.py
│   ├── state_util.py
│   └── trainer.py
├── setup.py
├── test
│   ├── border_detector.py
│   ├── README.md
│   ├── test_ac.py
│   ├── test_cql.py
│   ├── test_ddpg.py
│   ├── test_dqn.py
│   ├── test_env_explore.ipynb
│   ├── test_models
│   ├── test_ppo_atari.py
│   ├── test_ppo_new.py
│   ├── test_ppo.py
│   ├── test_sac.py
│   ├── test_softQ.py
│   ├── test_TD3.py
│   └── wandb
└── TODO.md
```

## 环境要求

核心包

| package | version |
|--|--|
|python版本 | `Python 3.10`|
|torch | 2.1.1|
|torchvision | 0.16.1|
|gymnasium | 0.29.1|
|cloudpickle | 2.2.1|


## 运行示例
```python
import gymnasium as gym
import torch
from RLAlgo.PPO2 import PPO2
from RLUtils import train_on_policy, random_play, play, Config, gym_env_desc

env_name = 'Hopper-v4'
gym_env_desc(env_name)
path_ = os.path.dirname(__file__) 
env = gym.make(
    env_name, 
    exclude_current_positions_from_observation=True,
    # healthy_reward=0
)
cfg = Config(
    env, 
    # 环境参数
    save_path=os.path.join(path_, "test_models" ,'PPO_Hopper-v4_test2'), 
    seed=42,
    # 网络参数
    actor_hidden_layers_dim=[256, 256, 256],
    critic_hidden_layers_dim=[256, 256, 256],
    # agent参数
    actor_lr=1.5e-4,
    critic_lr=5.5e-4,
    gamma=0.99,
    # 训练参数
    num_episode=12500,
    off_buffer_size=512,
    off_minimal_size=510,
    max_episode_steps=500,
    PPO_kwargs={
        'lmbda': 0.9,
        'eps': 0.25,
        'k_epochs': 4, 
        'sgd_batch_size': 128,
        'minibatch_size': 12, 
        'actor_bound': 1,
        'dist_type': 'beta'
    }
)
agent = PPO2(
    state_dim=cfg.state_dim,
    actor_hidden_layers_dim=cfg.actor_hidden_layers_dim,
    critic_hidden_layers_dim=cfg.critic_hidden_layers_dim,
    action_dim=cfg.action_dim,
    actor_lr=cfg.actor_lr,
    critic_lr=cfg.critic_lr,
    gamma=cfg.gamma,
    PPO_kwargs=cfg.PPO_kwargs,
    device=cfg.device,
    reward_func=None
)

agent.train()
train_on_policy(env, agent, cfg, wandb_flag=False, train_without_seed=True, test_ep_freq=1000, 
                online_collect_nums=cfg.off_buffer_size,
                test_episode_count=5)
agent.load_model(cfg.save_path)
agent.eval()
env_ = gym.make(env_name, 
                exclude_current_positions_from_observation=True,
                # render_mode='human'
                )
play(env_, agent, cfg, episode_count=3, play_without_seed=True, render=False)
```

## 训练结果展示

|环境与描述 | 参数函数链接| 效果|
|-|-|-|
|[ Hopper-v4 ](state: (11,),action: (3,)(连续 <-1.0 -> 1.0>))| [Hopper_v4_ppo2_test](./src/test/test_ppo.py) |![PPO2-PPO2_Hopper-v4](./docs/pic/PPO2_Hopper-v4.gif) |
|[ Humanoid-v4 ](state: (376,),action: (17,)(连续 <-0.4 -> 0.4>))| [Humanoid_v4_ppo2_test](./src/test/test_ppo.py) |![PPO2-PPO2_Humanoid-v4](./docs/pic/PPO2_Humanoid-v4-simple.gif) |
|[ ALE/DemonAttack-v5 ](state: (210, 160, 3),action: 6(离散 ))| [DemonAttack_v5_ppo2_test](./src/test/test_ppo_atari.py) |![PPO2_DemonAttack_v5](./docs/pic/PPO2_DemonAttack_v5.gif) |
|[ ALE/AirRaid-v5 ](state: (250, 160, 3),action: 6(离散 ))| [AirRaid_v5_ppo2_test](./src/test/test_ppo_atari.py) | ![PPO2_AirRaid_v5](./docs/pic/PPO2_AirRaid_v5.gif)|
|[ ALE/Alien-v5 ](state: (210, 160, 3),action: 18(离散 ))| [Alien_v5_ppo2_test](./src/test/test_ppo_atari.py) |![PPO2_Alien_v5](./docs/pic/PPO2_Alien_v5.gif) |
|[ Walker2d-v4 ](state: (17,),action: (6,)(连续 <-1.0 -> 1.0>))| [Walker2d_v4_ppo2_test](./src/test/test_ppo.py)   |![warlker](./docs/pic/PPO2_Walker2d_v4.gif) |
|[ HumanoidStandup-v4 ](state: (376,),action: (17,)(连续 <-0.4 -> 0.4>))| [HumanoidStandup_v4_ppo2_test](./src/test/test_ppo.py) |![stand](./docs/pic/PPO2_HumanoidStandup_v4.gif) |
|[ CartPole-v1 ](state: (4,),action: 2(离散))| [duelingDQN: dqn_test](./src/test/test_dqn.py) |  ![duelingDQN_CartPole](./docs/pic/duelingDQN_CartPole-v1.gif) |
|[ MountainCar-v0 ](state: (2,),action: 3(离散 ))| [duelingDQN: dqn_test](./src/test/test_dqn.py) |  ![duelingDQN_MountainCar](./docs/pic/duelingDQN_MountainCar-v0.gif) |
|[ Acrobot-v1 ](state: (6,),action: 3(离散 ))| [duelingDQN: Acrobot_dqn_test](./src/test/test_dqn.py) |  ![duelingDQN_Acrobot](./docs/pic/DQN_Acrobot-v1.gif) |
|[ LunarLander-v2 ](state: (8,),action: 4(离散 ))| [duelingDQN: LunarLander_dqn_test](./src/test/test_dqn.py) |  ![duelingDQN_LunarLander](./docs/pic/duelingDQN_LunarLander-v2.gif) |
|[ ALE/DemonAttack-v5 ](state: (210, 160, 3),action: 6(离散 ))| [doubleDQN: DemonAttack_v5_dqn_new_test](./src/test/test_dqn.py) |  ![doubleDQN-DemonAc](./docs/pic/DQN_DemonAttack-v5.gif) |
|[ BipedalWalker-v3 ](state: (24,),action: (4,)(连续 <-1.0 -> 1.0>))| [BipedalWalker_ddpg_test](./src/test/test_ddpg.py) | ![DDPG](./docs/pic/DDPG_BipedalWalker.gif) |
|[ BipedalWalkerHardcore-v3 ](state: (24,),action: (4,)(连续 <-1.0 -> 1.0>))| [BipedalWalkerHardcore_TD3_test](./src/test/test_TD3.py) | ![TD3](./docs/pic/TD3_perf_new.gif) |
|[ Reacher-v4 ](state: (11,),action: (2,)(连续 <-1.0 -> 1.0>))| [sac_Reacher_v4_test](./src/test/test_SAC.py) | ![SAC](./docs/pic/SAC_Reacher-v4.gif) |
|[ Pusher-v4 ](state: (23,),action: (7,)(连续 <-2.0 -> 2.0>))|  [sac_Pusher_v4_test](./src/test/test_SAC.py) | ![SAC-2](./docs/pic/SAC_Pusher-v4.gif) |
|[ CarRacing-v2 ](state: (96, 96, 3),action: (3,)(连续 <-1.0 -> 1.0>))| [CarRacing_TD3_test](./src/test/test_TD3.py) | ![TD3-car](./docs/pic/TD3_CarRacing-v2.gif) |
|[ InvertedPendulum-v4 ](state: (4,),action: (1,)(连续 <-3.0 -> 3.0>))| [InvertedPendulum_TD3_test](./src/test/test_TD3.py) | ![TD3-InvertedPendulum](./docs/pic/TD3_InvertedPendulum-v4.gif) |
|[ HalfCheetah-v4 ](state: (17,),action: (6,)(连续 <-1.0 -> 1.0>))| [HalfCheetah_v4_ppo_test](./src/test/test_ppo.py)  | ![PPO-PPO_HalfCheetah-v4](./docs/pic/PPO_HalfCheetah-v4.gif) |
|[ ALE/Breakout-v5 ](state: (210, 160, 3),action: 4(离散 ))| [Breakout_v5_ppo2_test](./src/test/test_ppo_atari.py)  |![Breakout](./docs/pic/PPO2_Breakout_v5.gif) |
|[ ALE/DoubleDunk-v5 ](state: (210, 160, 3),action: 18(离散 ))| [DoubleDunk_v5_ppo2_test](./src/test/test_ppo_atari.py) | ![DoubleDunk](./docs/pic/PPO2_DoubleDunk_v5.gif)|
|[ ALE/Galaxian-v5 ](state: (210, 160, 3),action: 6(离散 ))| [Galaxian_v5_ppo2_test](./src/test/test_ppo_atari.py)  |![Galaxian](./docs/pic/PPO2_Galaxian_v5.gif) |
|[ Walker2d-v4 ](state: (17,),action: (6,)(连续 <-1.0 -> 1.0>))| [cql_Walker2d_v4_simple_test](./src/test/test_cql.py)   |![warlker](./docs/pic/CQL_Walk2d-v4.gif) |