# alphagen
**Repository Path**: prg/alphagen
## Basic Information
- **Project Name**: alphagen
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-01-17
- **Last Updated**: 2025-01-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# AlphaGen
Automatic formulaic alpha generation with reinforcement learning.
This repository contains the code for our paper *Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning* accepted by [KDD 2023](https://kdd.org/kdd2023/), Applied Data Science (ADS) track, publically available on [ACM DL](https://dl.acm.org/doi/10.1145/3580305.3599831). Some extensions upon this work are also included in this repo.
## Repository Structure
- `/alphagen` contains the basic data structures and the essential modules for starting an alpha mining pipeline;
- `/alphagen_qlib` contains the qlib-specific APIs for data preparation;
- `/alphagen_generic` contains data structures and utils designed for our baselines, which basically follow [gplearn](https://github.com/trevorstephens/gplearn) APIs, but with modifications for quant pipeline;
- `/alphagen_llm` contains LLM client abstractions and a set of prompts useful for LLM-based alpha generation, and also provides some LLM-based automatic iterative alpha-generation routines.
- `/gplearn` and `/dso` contains modified versions of our baselines;
- `/scripts` contains several scripts for running the experiments.
## Result Reproduction
Note that you can either use our builtin alpha calculation pipeline (see Choice 1), or implement an adapter to your own pipeline (see Choice 2).
### Choice 1: Stock data preparation
Builtin pipeline requires Qlib library and local-storaged stock data.
- READ THIS! We need some of the metadata (but not the actual stock price/volume data) given by Qlib, so follow the data preparing process in [Qlib](https://github.com/microsoft/qlib#data-preparation) first.
- The actual stock data we use are retrieved from [baostock](http://baostock.com/baostock/index.php/%E9%A6%96%E9%A1%B5), due to concerns on the timeliness and truthfulness of the data source used by Qlib.
- The data can be downloaded by running the script `data_collection/fetch_baostock_data.py`. The newly downloaded data is saved into `~/.qlib/qlib_data/cn_data_baostock_fwdadj` by default. This path can be customized to fit your specific needs, but make sure to use the correct path when loading the data (In `alphagen_qlib/stock_data.py`, function `StockData._init_qlib`, the path should be passed to qlib with `qlib.init(provider_uri=path)`).
### Choice 2: Adapt to external pipelines
Maybe you have better implements of alpha calculation, you can implement an adapter of `alphagen.data.calculator.AlphaCalculator`. The interface is defined as follows:
```python
class AlphaCalculator(metaclass=ABCMeta):
@abstractmethod
def calc_single_IC_ret(self, expr: Expression) -> float:
'Calculate IC between a single alpha and a predefined target.'
@abstractmethod
def calc_single_rIC_ret(self, expr: Expression) -> float:
'Calculate Rank IC between a single alpha and a predefined target.'
@abstractmethod
def calc_single_all_ret(self, expr: Expression) -> Tuple[float, float]:
'Calculate both IC and Rank IC between a single alpha and a predefined target.'
@abstractmethod
def calc_mutual_IC(self, expr1: Expression, expr2: Expression) -> float:
'Calculate IC between two alphas.'
@abstractmethod
def calc_pool_IC_ret(self, exprs: List[Expression], weights: List[float]) -> float:
'First combine the alphas linearly,'
'then Calculate IC between the linear combination and a predefined target.'
@abstractmethod
def calc_pool_rIC_ret(self, exprs: List[Expression], weights: List[float]) -> float:
'First combine the alphas linearly,'
'then Calculate Rank IC between the linear combination and a predefined target.'
@abstractmethod
def calc_pool_all_ret(self, exprs: List[Expression], weights: List[float]) -> Tuple[float, float]:
'First combine the alphas linearly,'
'then Calculate both IC and Rank IC between the linear combination and a predefined target.'
```
Reminder: the values evaluated from different alphas may have drastically different scales, we recommend that you should normalize them before combination.
### Before running
All principle components of our expriment are located in [train_maskable_ppo.py](train_maskable_ppo.py).
These parameters may help you build an `AlphaCalculator`:
- instruments (Set of instruments)
- start_time & end_time (Data range for each dataset)
- target (Target stock trend, e.g., 20d return rate)
These parameters will define a RL run:
- batch_size (PPO batch size)
- features_extractor_kwargs (Arguments for LSTM shared net)
- device (PyTorch device)
- save_path (Path for checkpoints)
- tensorboard_log (Path for TensorBoard)
### Run the experiments
Please run the individual scripts at the root directory of this project as modules, i.e. `python -m scripts.NAME ARGS...`.
Use `python -m scripts.NAME -h` for information on the arguments.
- `scripts/rl.py`: Main experiments of AlphaGen/HARLA
- `scripts/llm_only.py`: Alpha generator based solely on iterative interactions with an LLM.
- `scripts/llm_test_validity.py`: Tests on how the system prompt affects the valid alpha rate of an LLM.
### After running
- Model checkpoints and alpha pools are located in `save_path`;
- The model is compatiable with [stable-baselines3](https://github.com/DLR-RM/stable-baselines3)
- Alpha pools are formatted in human-readable JSON.
- Tensorboard logs are located in `tensorboard_log`.
## Baselines
### GP-based methods
[gplearn](https://github.com/trevorstephens/gplearn) implements Genetic Programming, a commonly used method for symbolic regression. We maintained a modified version of gplearn to make it compatiable with our task. The corresponding experiment scipt is [gp.py](gp.py)
### Deep Symbolic Regression
[DSO](https://github.com/brendenpetersen/deep-symbolic-optimization) is a mature deep learning framework for symbolic optimization tasks. We maintained a minimal version of DSO to make it compatiable with our task. The corresponding experiment scipt is [dso.py](dso.py)
## Trading (Experimental)
We implemented some trading strategies based on Qlib. See [backtest.py](backtest.py) and [trade_decision.py](trade_decision.py) for demos.
## Citing our work
```bibtex
@inproceedings{alphagen,
author = {Yu, Shuo and Xue, Hongyan and Ao, Xiang and Pan, Feiyang and He, Jia and Tu, Dandan and He, Qing},
title = {Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning},
year = {2023},
doi = {10.1145/3580305.3599831},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
}
```
## Contributing
Feel free to submit Issues or Pull requests.
## Contributors
This work is maintained by the MLDM research group, [IIP, ICT, CAS](http://iip.ict.ac.cn/).
Maintainers include:
- [Hongyan Xue](https://github.com/xuehongyanL)
- [Shuo Yu](https://github.com/Chlorie)
Thanks to the following contributors:
- [@yigaza](https://github.com/yigaza)
Thanks to the following in-depth research on our project:
- *因子选股系列之九十五: DFQ强化学习因子组合挖掘系统*