# keras-dsb-2-refactor-branch

**Repository Path**: meanmee/keras-dsb-2-refactor-branch

## Basic Information

- **Project Name**: keras-dsb-2-refactor-branch
- **Description**: our(team keras) solution of kaggle second national data science bowl
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: https://www.kaggle.com/c/second-annual-data-science-bowl
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2016-02-01
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## How to use

### Data

Before creating/running experiments, training and validation data has to be created.
This can be done locally (by using Kaggle provided data) or remotely (downloading from our S3 bucket).

To create data locally run:

```python
python data.py create local train path_to_kaggle_data
python data.py create local validate path_to_kaggle_data
```

To download data from S3 bucket run:

```python
python data.py create s3 train
python data.py create s3 validate
```

Running either of these commands will create .npy files in folder **dataset/train_npy_files**
and **dataset/validation_npy_files**.

---

### Experiments

##### To create experiment, run:

```python
python experiment.py create experiment_name
```

where *experiment_name* should be something unique we can agree on.
I suggest *experiment_name* to be in format **{author_initials}\_{yyyyMMdd}\_{brief_description}**.
For example, if I create a new experiment, I could name it **mj_20161101_simple_cnn** or something like that.

An experiment is automatically created in **experiments/** folder, with some basic structure and files
(check folder **experiment_template/** for more details):

* **model.py** - define your model by implementing function **get_model()**. (NOTE: **don't compile the model yet**)
* **training.py** - define your training metadata (optimizer, loss, batch size, etc.) by implementing
function **get_traning_metadata()**
* **preprocess.py** - define pre-processing done to X (croping, resizing, rotating, etc.)
and optionally y (converting to probabilities, CDF, etc.) by implementing functions **pre_train()**
(pre-processing done on data before training) and **per_epoch()**
(pre-processing done every epoch, i.e. data augmentation)
* **postprocess.py** - define post-processing done to y by implementing function **to_cdf()**
(which transforms the network output to corresponding CDF)
* **custom.py** - whatever custom stuff you want to use *specifically for this experiment*
(custom optimizer, layer, loss function, etc.)
* **metadata** - consists of generated data like model serialized in JSON (**model.json**),
saved training metadata with training performance log (**training.json**),
and models weights (**weights_systole.hdf5** and **weights_diastole.hdf5**)

You can check function **create()** in **experiment.py** for more details.


##### To run an experiment, run:

```python
python experiment.py run experiment_name
```

When you run a given experiment, what happens is:

1. Load models (systole and diastole) by calling **model.get_model()** function
2. Load training metadata by calling **training.get_training_metadata()** function
3. Compile models by using training metadata
4. Pre-process the loaded data by calling **preprocess.pre_train()** function
5. Train:
    1. Load a batch of data from disk
    2. Pre-process the data by calling **preprocess.per_epoch()** function
    3. Set learning rate (you can set a decay factor)
    4. Fit models
    5. Evaluate models by CRPS metric
    6. Update training performance log in **training.json**
    7. Save models weights
    8. There's an option to create a submission when a new best model is found enabled by `run(inter_train_submission=True)`

You can check function **run()** in **experiment.py** for more details.

*NOTE: this function shouldn't be changed specifically for certain experiment, since it does stuff common for
all experiments. Of course, before everyone starts using this, this function should be modified according to team
members opinions (in case I missed anything). Also it would be great if all changes to this function are backward
compatible with older experiments, so we can run them without problems.*


##### To continue an experiment, run:

```python
python experiment.py continue experiment_name
```

Based on the saved training metadata and saved weights, the experiment can be continued (if stopped for some reason).


##### To copy an experiment, run:

```python
python experiment.py copy experiment_name new_experiment_name
```

New experiments can be created by copying current experiments.
This can be useful when you want to retry an experiment with somewhat different model, hyper-params, etc.


##### To create a submission for an experiment, run:

```python
python experiment.py submission experiment_name
```

This will load the trained model and predict values on the validation data,
and then create a submission file **submission.csv** in experiment **metadata/** folder.


---

### Utils

In **utils/** folder you can find many utility functions which can be useful for *all experiments*.

* **plot_utils** allow you to build and update graphs in real time
* **s3_utils** allow upload, download and sync to AWS storage

After each epoch an automated sync takes place. To sync weights use `weights=True` in the `s3_sync()` method.
For manual sync from the project directory run 

`python utils/sync.py`
 
and it should automatically sync stuff

---

### Keras extras

In **keras_extras/** folder we should add some extra stuff that can be useful for *all experiments*.