# dist_optuna_plus_wandb
**Repository Path**: valaxkong/dist_optuna_plus_wandb
## Basic Information
- **Project Name**: dist_optuna_plus_wandb
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-02-19
- **Last Updated**: 2025-02-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Distributed Optuna plus WandB
This repo provides a way to incorporate [Optuna](https://optuna.org/) and [WandB](https://wandb.ai/site).
You may need this repo if you want to:
1. log and track your experiment using [WandB](https://wandb.ai/site).
2. enjoy automate hyperparameter search using [Optuna](https://optuna.org/).
3. have multiple runs for hyperparameter searching at the same time.
You will get this:
## Structure Overview
distributed_optuna_plus_wandb provides a trainer.
The tuner is built based on the trainer.
### Trainer Design
The following code shows the structure of `run()` .
``` python
def run(self):
self.before_train()
for epoch in range(self.max_epoch):
self.current_epoch = epoch
self._iter_train_loader = iter(self.train_loader)
for batch in range(self.len_loader):
self.current_iter = batch + self.current_epoch * self.len_loader
self.before_iter()
# train one iter
self.model.train()
batch = next(self._iter_train_loader)
self.send_batch_to_device(batch)
self.train_one_iter(batch)
self.model.eval()
# valid if needed
if self.current_iter % self.valid_freq == 0:
with torch.no_grad():
self.valid()
self.after_iter()
self.after_train()
```
### Tuner Design
distributed_optuner_plus_wandb mainly relies on ` OptunawandbHook` and ` OptunawandbBase` .
` OptunawandbBase` is a controller to run the trainer.
` OptunawandbHook` takes over logging, pruning, value reporting stuffs for each trial.
## Usage
You can run an [example](https://github.com/ThisUserIsSuperCool/dist_optuna_plus_wandb/blob/master/src/main.py) here to see how to tune params while training a LSTM on MINIST using the distributed_optuner_plus_wandb.
### Step 1: Use the Trainer
Run ` train.py` to solely enjoy the trainer. see [example](https://github.com/ThisUserIsSuperCool/dist_optuna_plus_wandb/blob/master/src/trainer.py) here.
``` python
class Trainer(BaseTrainer):
def __init__(self,model,opt,train_loader,valid_loader,test_loader,device,max_epoch,valid_freq=1):
super(Trainer,self).__init__(model,opt,train_loader,valid_loader,test_loader,device,max_epoch,valid_freq)
def train_one_iter(self,batch):
x,y = batch
logits = self.model(x)
loss = F.cross_entropy(logits,y)
self.opt.zero_grad()
loss.backward()
self.opt.step()
self.train_out = {
"loss":loss,
}
def valid(self,):
...
self.valid_out = {
"loss":loss,
}
trainer = Trainer(model=model,opt=opt,train_loader=train_loader,valid_loader=valid_loader,
test_loader=test_loader,device=torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
max_epoch=5,valid_freq=10,)
trainer.run()
```
The defualt logger is taken over by the LogHook. The metrics you save at ` self.train_out` and ` self.valid_out` will be automatically plotted and saved. After running the experiment, you will get a log dir like this:
```
output
└── 2023_07_13_10_23
├── Plots of training summary metrics.pdf
├── Plots of training summary metrics.png
├── Plots of validation summary metrics.pdf
├── Plots of validation summary metrics.png
└── log.txt
```
### Step 2: Use the Tuner
1. create a class that inherits from OptunawandbBase, and implement the methods: ` load_items` , ` get_suggested_params` , and ` get_other_cfg` .
2. call ` run()` to start searching for the best params.
``` python
class ParamTune(OptunawandbBase):
def load_items(self,**kwargs):
"""return the trainer"""
# load cfg defined in get_suggested_params() and get_other_cfg()
opt = optim.Adam(model.parameters(), lr=kwargs["lr"])
# init a trainer in step 1
trainer = Trainer(
...
)
return trainer
def get_suggested_params(self, trial):
"""define the params that you want to tune"""
trial_params = dict(
lr = trial.suggest_loguniform('lr', 1e-4,1e-1),
hidden_nodes = trial.suggest_categorical('hidden_nodes', [32,64,128])
)
return trial_params
def get_other_cfg(self):
"""define other params that you don't want to tune"""
return dict(
input_size = 28,
num_layers = 2,
)
paramtuner = ParamTune(
proj_name = 'myproj',
exp_purpose = 'Distributed Optuna plus WandB: test the code.',
do_init_wandb = True, # whether to init wandb, set to False if you want to run the code locally
metric = 'valid_acc', # metric to be optimized, should be included in self.valid_out in the method Trainer.valid()
n_search_trials = 10, # number of trials to find the best hyperparams
report_last_n=5,
do_fixed_trial=False,
n_fixed_trials=1,
trial_name=None,
storage='sqlite:///optuna.db',
n_jobs=4, # number of parallel jobs
n_gpus=2, # number of gpus
client=None,
allow_prune=True, # whether prune the trial
)
# register the given fixed trial(s) using self.study.enqueue_trial(trial_param)
# Enqueue a trial with given parameter values.
# fix the sampling parameters at the beginning of searching
trial_param_list = [
dict(lr = 1e-2,hidden_size = 256),
dict(lr = 1e-2,hidden_size = 64),
]
# start searching...
paramtuner.run(
trial_param_list=trial_param_list,
)
```
## Features
#### trial_param_list: the start point to search
You can fix the sampling parameters at the beginning of searching.
distributed_optuna_plus_wandb will call ` study.enqueue_trial` . See doc at [link](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.enqueue_trial).
Define your ` trial_param_list` and pass it to the ` ParamTune` object, then `trial_param_list` could be the start point to search.
``` python
# register the given fixed trial(s) using self.study.enqueue_trial(trial_param)
# Enqueue a trial with given parameter values.
# fix the sampling parameters at the beginning of searching
trial_param_list = [
dict(lr = 1e-2,hidden_size = 256,), # the first set of params to run
dict(lr = 1e-2,hidden_size = 64,), # the second set of params to run
]
# start searching...
paramtuner.run(
trial_param_list=trial_param_list,
)
```
#### Fixed trails: run N times of your given setting
- ` do_fixed_trial` : set to True to run fixed number of trials (` n_fixed_trials` ) with fixed params, the fixed params should be specified in ` trial_param_list` . pass ` trial_param_list` to run() to specify the params.
- ` n_fixed_trials`: number of fixed trials to run.
In the following example, the first set of params will be run for 5 times and then the second set of params will be run 5 times, i.e., 10 trials in total.
For each trail/run, the seed will be automatically set to 42 + trail_number if the same set of params has been used.
``` python
paramtuner = ParamTune(
...
do_fixed_trial=True,
n_fixed_trials=5,
...
)
trial_param_list = [
dict(lr = 1e-2,hidden_size = 256,**paramtuner.get_other_cfg()), # the first set of params to run
dict(lr = 1e-2,hidden_size = 64,**paramtuner.get_other_cfg()), # the second set of params to run
]
# start searching...
paramtuner.run(
trial_param_list=trial_param_list,
)
```
#### Define which metric to be optimized
1. save the metrics that you want to optimized to `self.valid_out` (which is a dict) in trainer in `valid()` method.
``` python
class Trainer(BaseTrainer):
...
def valid(self,):
metric_to_report = ...
...
self.valid_out = {
"loss":loss,
"metric_to_report": metric_to_report,
}
```
2. show the tuner which metric to be optimized.
``` python
paramtuner = ParamTune(
...
metric = 'valid_metric_to_report', # must add 'valid_'
report_last_n=5, # report the mean of last five valid_metric_to_report in the history
...
)
```
#### Prune the trial
By default, pruning is allowed. And is forced to be not allowed in fixed trails mode (` do_fixed_trial=True` and ` n_fixed_trials>1` ). Set ` allow_prune = False` to disable pruning.
The pruning is based on reported value, which is the `metric` you set in the previous subsection.
## Dependency
```
torch 1.11.0
optuna-distributed 0.4.0
optuna 3.1.1
wandb 0.15.0
python 3.8.12
```
## Acknowledgments
- [PyTorch](https://pytorch.org/), [Optuna](https://optuna.org/), and [WandB](https://wandb.ai/site), serve as the foundation upon this template.
- [Core PyTorch Utils (CPU)](https://github.com/serend1p1ty/core-pytorch-utils) is mainly referred when developing the trainer. Only simple functions of the trainer were implemented in this repo.
- [optima-distributed](https://github.com/xadrianzetx/optuna-distributed) is used to allow distributed optimization of Optuna. Also see its post at [medium](https://medium.com/optuna/running-distributed-hyperparameter-optimization-with-optuna-distributed-17bb2f7d422d).