# GDP-KPRN

**Repository Path**: Lamaric/GDP-KPRN

## Basic Information

- **Project Name**: GDP-KPRN
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-03-10
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GDP-KPRN

***Last edit : 29 July 2019***

Recommender system referencing [KPRN](https://arxiv.org/pdf/1811.04540.pdf), [original github](https://github.com/eBay/KPRN), trained using custom [MovieLens-20M](https://grouplens.org/datasets/movielens/20m/) dataset.

> The model implemented has slight difference where no pooling layer added at the end of LSTM.

# Domain of problems
*Given a path between user and an item, predict how likely the user will interact with the item*

# Contents
- `/cache` : temporary files used in training
- `/data` : contains dataset (custom ml-20m dataset where only movies shows up in Ripple-Net's knowledge graph used) used in training.
- **`/log`** : contains training result stored in single folder named after training timestamp.
- **`/test`** : contains jupyter notebook used in testing the trained models

- `KPRN-LSTM.ipynb` : notebook to train model
- `main.py` : python3 version of KPRN-LSTM.ipynb

### Note
    *italic* means this folder is ommited from git, but necessary if you need to run experiments
    **bold** means this folder has it's own README, check it for detailed information :)

# Preparing 
## Installing dependencies 

    pip3 install -r requirements.txt

# How to run
1. Unzip `ratings_re.zip` and `ratings_re.z01` in `/data`
2. To preprocess, run `Preprocess.ipynb` notebook or `preprocess.py`
    ~~~
    python3 data/preprocess.py
    ~~~

3. To train, run `KPRN-LSTM.ipynb` notebook or `main.py`
    ~~~
    python3 main.py
    ~~~

## **! Caching warning !**
To start using new dataset, or if you wish to generate new dataset, please delete all items inside `/cache`

# Training
## How to change hyper parameter
Open `KPRN-LSTM.ipynb` or `main.py` and change the model parameters

# Testing / Evaluation
## How to check training result
1. Find the training result folder inside `/log` (find the latest), copy the folder name.
2. Create copy of latest jupyter notebook inside `/test` folder.
3. Rename folder to match a folder in `/log` (for traceability purpose).
4. Replace `TESTING_CODE` at the top of the notebook.
5. Run the notebook

# Final result

### KPRN - pool_size = 1 (no pooling)
| Evaluation size    | Prec@10 | Distinct@10  |  Unique items |
|--------------------|---------|--------------|---------------|
| Eval on  10 user   | 0.12028 |  0.70000     |    70         |
| Eval on  30 user   | 0.16667 |  0.60667     |   182         |
| Eval on 100 user   | 0.17471 |  0.38600     |   386         |

### KPRN - pool_size = 3
| Evaluation size    | Prec@10 | Distinct@10  |  Unique items |
|--------------------|---------|--------------|---------------|
| Eval on  10 user   | 0.20000 |  0.32000     |    32         |
| Eval on  30 user   | 0.24333 |  0.21000     |    63         |
| Eval on 100 user   | 0.25864 |  0.13400     |   134         |

### KPRN - pool_size = 5
| Evaluation size    | Prec@10 | Distinct@10  |  Unique items |
|--------------------|---------|--------------|---------------|
| Eval on  10 user   | 0.18667 |  0.31000     |    31         |
| Eval on  30 user   | 0.18000 |  0.14667     |    44         |
| Eval on 100 user   | 0.23453 |  0.08400     |    84         |


# Findings

**KPRN relies heavily upon paths**, and those paths are *handcrafted* by using the knowledge-graph. The paths are also sampled from hundreds of million possible paths.

- To find paths  
    ```
    (user -> seed item (eg: Castle on The Hill) -> entity (eg: Ed Sheeran) -> suggestion (eg: Perfect)) 
    ```
    from each seed, we can extract millions of paths (if not sampled), even after sampled using only one relation (eg: same artist, same albums, etc per seed, it still generates around 8k-10k path per seed.

- Each user has multiple item work as seed (typically 20+), this need to be sampled again to reduce paths generated and reduce computational cost. 

- We do make sure each item in suggestion has about 4 - 7 paths

- At this point, we only evaluate on around 75 - 150 path per user, out of possible hundred million possible paths  

- That's a huge possible source of sampling bias, but at the same time, it's kinda impossible to search through all paths.

- Looking from the result of KPRN, the usage of KG might turn out to be quite promising, especially to improve the diversity of suggestion.
  
- The downside of using KPRN is that the result rely heavily on 'handcrafted' paths, which undergoes a lot of downsampling steps. 
  
- Summary compared to non-KG RecSys: **Big improvement in terms of Prec@k and distinct rate**

# Pros
- Able to incorporate Knowledge Graph as another source of information
- Able to infer why a user is given such suggestions (based on path scores)
- Able to adjust between 'exploration and optimization' by applying result pooling (the model doesn't require to be re-trained)
- During training, the model converge really fast (< 10 epochs)

# Cons
- No original implementation usable
- Relies heavily upon paths, and those paths are 'handcrafted' by using the knowledge-graph and also sampled from hundreds of million possible paths.
- Huge possible sampling bias introduced from preprocessing step and path generation step.
- The model remember the user, the model need to be re-trained for every new user and  item addition.
- Require relatively slow preprocessing
- Super slow train and prediction time
- Loss function and metric used in training is not Prec@K, instead it uses accuracy.

# Experiment notes
- At the cost of slightly different implementation, it's easier to implement using high-level libraries such as Keras instead of using original version.
- path generation & predict time : about 4k path / second
- different sampling method and sampling parameter has insignificant effect
- Using more items as path generation 'seed' (for predicting suggestion), should lead to more personalized suggestions. (i.e. consider the suggestion by using more user history)
- By pooling path prediction-score for the same items, the model should be able to give a better suggestion since it considers multiple reasons instead of just a single reason.


# Author
- Jessin Donnyson - jessinra@gmail.com

# Contributors
- Michael Julio - michael.julio@gdplabs.id
- Fallon Candra - fallon.candra@gdplabs.id