# TKT

**Repository Path**: mulinzi1996/TKT

## Basic Information

- **Project Name**: TKT
- **Description**: 各种知识追踪模型的pytorch实现

- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 3
- **Created**: 2021-07-10
- **Last Updated**: 2021-07-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TKT

[![PyPI](https://img.shields.io/pypi/v/TKT.svg)](https://pypi.python.org/pypi/TKT)
[![Build Status](https://www.travis-ci.org/tswsxk/TKT.svg?branch=master)](https://www.travis-ci.org/tswsxk/TKT)
[![codecov](https://codecov.io/gh/tswsxk/TKT/branch/master/graph/badge.svg)](https://codecov.io/gh/tswsxk/TKT)

Multiple Knowledge Tracing models implemented by mxnet-gluon. 
For convenient dataset downloading and preprocessing of knowledge tracing task, 
visit [Edudata](https://github.com/bigdata-ustc/EduData) for handy api.

Visit https://base.ustc.edu.cn for more of our works.

## Performance in well-known Dataset

With [`EduData`](https://pypi.python.org/pypi/EduData), we test the models performance, the AUC result is listed as follows:

|model name  | synthetic | assistment_2009_2010 | junyi |
| ---------- | - |------------------ | ----- |
| DKT        | 0.6438748958881487 | 0.7442573465541942 | 0.8305416859735839 |
| DKT+       | **0.8062221383790489** | 0.7483424087919035 | **0.8497422607539136** |
| EmbedDKT   | 0.4858168704660636 | 0.7285572301977586 | 0.8194401881889697 |
| EmbedDKT+ | 0.7340996181876187 | **0.7490900876356051** |0.8405445812109871|
| DKVMN | TBA | TBA |TBA|

The f1 scores are listed as follows:

|model name  | synthetic | assistment_2009_2010 | junyi |
| ---------- | ------------------ | ----- | ----- |
| DKT        | 0.5813237474584396 | 0.7134380508024369 | 0.7732850122818582 |
| DKT+       | **0.7041804463370387** | **0.7137627713343819** | **0.7928075377114897** |
| EmbedDKT   | 0.4716821311199386     | 0.7095025134079656 | 0.7681817174082963 |
| EmbedDKT+   | 0.6316953625658291 | 0.7101790604990228 | 0.7903592922756097 |
| DKVMN | TBA       | TBA                  | TBA   |

The information of the benchmark datasets can be found in EduData docs.

In addition, all models are trained 20 epochs with `batch_size=16`, where the best result is reported.  We use `adam` with `learning_rate=1e-3`. We also apply `bucketing` to accelerate the training speed. Moreover, each sample length is limited to 200. The hyper-parameters are listed as follows:

|model name  | synthetic - 50 | assistment_2009_2010 - 124 | junyi-835 |
| ---------- | ------------------ | ----- | ----- |
| DKT        | `hidden_num=int(100);dropout=float(0.5)` | `hidden_num=int(200);dropout=float(0.5)` | `hidden_num=int(900);dropout=float(0.5)` |
| DKT+       | `lr=float(0.2);lw1=float(0.001);lw2=float(10.0)` | `lr=float(0.1);lw1=float(0.003);lw2=float(3.0)` | `lr=float(0.01);lw1=float(0.001);lw2=float(1.0)` |
| EmbedDKT   | `hidden_num=int(100);latent_dim=int(35);dropout=float(0.5)` | `hidden_num=int(200);latent_dim=int(75);dropout=float(0.5)` | `hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)` |
| EmbedDKT+   | `lr=float(0.2);lw1=float(0.001);lw2=float(10.0)` | `lr=float(0.1);lw1=float(0.003);lw2=float(3.0)` | `lr=float(0.01);lw1=float(0.001);lw2=float(1.0)` |
| DKVMN      | `hidden_num=int(50);key_embedding_dim=int(10);value_embedding_dim=int(10);key_memory_size=int(5);key_memory_state_dim=int(10);value_memory_size=int(5);value_memory_state_dim=int(10);dropout=float(0.5)` | `hidden_num=int(50);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(50);key_memory_state_dim=int(50);value_memory_size=int(50);value_memory_state_dim=int(200);dropout=float(0.5)` | `hidden_num=int(600);key_embedding_dim=int(50);value_embedding_dim=int(200);key_memory_size=int(20);key_memory_state_dim=int(50);value_memory_size=int(20);value_memory_state_dim=int(200);dropout=float(0.5)` |

The number after `-` in the first row indicates the knowledge units number in the dataset. The datasets we used can be  either found in [basedata-ktbd](http://base.ustc.edu.cn/data/ktbd/) or be downloaded by:

```shell
pip install EduData
edudata download ktbd
```

### Trick

* DKT: `hidden_num` is usually set to the nearest hundred number to the `ku_num`
* EmbedDKT: `latent_dim` is usually set to a value litter than or equal to `\sqrt(hidden_num * ku_num)`
* DKVMN: `key_embedding_dim = key_memory_state_dim` and `value_embedding_dim = value_memory_state_dim`

### Notice
Some interfaces of pytorch may change with version changing, such as
```python
import torch
torch.nn.functional.one_hot
```
which may caused some errors like:
```shell
AttributeError: module 'torch.nn.functional' has no attribute 'one_hot'
```

Except that, there is a known bug `Segmentation fault: 11`:
```shell
Segmentation fault: 11

Stack trace:
  [bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2e6b160) [0x7f3e4b5b6160]
  [bt] (1) /lib64/libc.so.6(+0x36340) [0x7f3ec3c89340]
  [bt] (2) /usr/local/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x40a5760) [0x7f3dc265c760]
  [bt] (3) /usr/local/lib/python3.6/site-packages/torch/lib/libtorch.so(+0x40a35c5) [0x7f3dc265a5c5]
  [bt] (4) /lib64/libstdc++.so.6(+0x5cb19) [0x7f3eb807db19]
  [bt] (5) /lib64/libc.so.6(+0x39c29) [0x7f3ec3c8cc29]
  [bt] (6) /lib64/libc.so.6(+0x39c77) [0x7f3ec3c8cc77]
  [bt] (7) /lib64/libc.so.6(__libc_start_main+0xfc) [0x7f3ec3c7549c]
  [bt] (8) python3() [0x41da20]
```
However, the mentioned-above bug does not affect the train and evaluation.

PS. if you think those problems are so easy to solve, please do not hesitate to contact us :-).


## Tutorial

### Installation

1. First get the repo in your computer by `git` or any way you like.
2. Suppose you create the project under your own `home` directory, then you can use use 
    1. `pip install -e .` to install the package, or
    2. `export PYTHONPATH=$PYTHONPATH:~/TKT`
    
### Data Format
In `TKT`, all sequence is store in `json` format, such as:
```json
[[419, 1], [419, 1], [419, 1], [665, 0], [665, 0]]
```
Each item in the sequence represent one interaction. The first element of the item is the exercise id 
and the second one indicates whether the learner correctly answer the exercise, 0 for wrongly while 1 for correctly  
One line, one `json` record, which is corresponded to a learner's interaction sequence.

A demo loading program is presented as follows:
```python
import json
from tqdm import tqdm

def extract(data_src):
    responses = []
    step = 200
    with open(data_src) as f:
        for line in tqdm(f, "reading data from %s" % data_src):
            data = json.loads(line)
            for i in range(0, len(data), step):
                if len(data[i: i + step]) < 2:
                    continue
                responses.append(data[i: i + step])

    return responses
```
The above program can be found in `TKT/TKT/shared/etl.py`.

To deal with the issue that the dataset is store in `tl` format:

```text
5
419,419,419,665,665
1,1,1,0,0
```

Refer to [Edudata Documentation](https://github.com/bigdata-ustc/EduData#format-converter).


### CLI

#### General Command Format

---

All command to invoke the model has the same cli canonical form:
```shell
python Model.py $subcommand $parameters1 $parameters2 ...
```
There are several options for the subcommand, use `--help` to see more options and the corresponding parameters:
```shell
python Model.py --help
python Model.py $subcommand --help 
```
The cli tools is constructed based on [longling ConfigurationParser](https://longling.readthedocs.io/zh/latest/submodule/lib/index.html#module-longling.lib.parser).

#### Demo

---

As an example, suppose you create the project under your own `home` directory 
and create a `data` directory to store the data (like `train` and `test`) and model. 
Assume that you are going to test the models on [ktbd](http://base.ustc.edu.cn/data/ktbd/) dataset, 
and the toc of the project is looked like as follows:

```text
└── XKT/                            
    ├── data/
    │   └── ktbd/                
    │        ├── junyi/             <-- dataset
    │        │   ├── train.json
    │        │   └── test.json
    │        ├── ...
    │        └── synthetic/
    ├── ...
    └── XKT/
```

And in each dataset, `train.json` is the training dataset, and `test.json` is the test dataset, 
we want the model is placed under the corresponding dataset directory,
where a `model` directory is created to store the all models. Thus, we use the following command to train the model


```shell
# basic
python3 DKT.py train $HOME/XKT/data/ktbd/junyi/train.json $HOME/XKT/data/ktbd/junyi/test.json --hyper_params "nettype=EmbedDKT;ku_num=int(835);hidden_num=int(900);dropout=float(0.5)" --ctx="gpu(0)" --model_dir $HOME/XKT/data/ktbd/junyi/model/DKT 
# advanced path configuration
python3 DKT.py train \$data_dir/train.json \$data_dir/test.json --hyper_params "nettype=EmbedDKT;ku_num=int(835);hidden_num=int(900);dropout=float(0.5)" --ctx="gpu(0)" --model_name DKT --root=$HOME/XKT --root_data_dir=\$root/data/ktbd/\$dataset --data_dir=\$root_data_dir --dataset=junyi
```
And we can get something like that:
```text
junyi/
├── model/
│   └── DKT/
│       ├── configuration.json
│       ├── DKT-0001.parmas
│       ├── DKT-0002.parmas
│       ├── ...
│       ├── DKT-0020.parmas
│       ├── result.json
│       └── result.log
├── test.json
└── train.json
```
The two command mentioned above are equally the same. 
About how to use the advanced path configuration, 
refer to [longling doc](https://longling.readthedocs.io/zh/latest/submodule/ML/index.html#configuration).

---

### DKT
```shell
# DKT
python3 DKT.py train \$data_dir/train.json \$data_dir/test.json --hyper_params "nettype=EmbedDKT;ku_num=int(835);hidden_num=int(900);dropout=float(0.5)" --ctx="cuda:0" --model_name DKT --root=$HOME/TKT --root_data_dir=\$root/data/ktbd/\$dataset --data_dir=\$root_data_dir --dataset=junyi
# DKT+
python3 DKT.py train \$data_dir/train.json \$data_dir/test.json --hyper_params "nettype=DKT;ku_num=int(835);hidden_num=int(900);dropout=float(0.5)" --loss_params "lr=float(0.1);lw1=float(0.003);lw2=float(3.0)" --ctx="cuda:0" --model_name DKT+ --root=$HOME/TKT --root_data_dir=\$root/data/ktbd/\$dataset --data_dir=\$root_data_dir --dataset=junyi
```


### EmbedDKT
```shell
# EmbedDKT
python3 DKT.py train \$data_dir/train.json \$data_dir/test.json --hyper_params "nettype=EmbedDKT;ku_num=int(835);hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)" --ctx="cuda:0" --model_name EmbedDKT --root=$HOME/TKT --root_data_dir=\$root/data/ktbd/\$dataset --data_dir=\$root_data_dir --dataset=junyi
# EmbedDKT+ 
python3 DKT.py train \$data_dir/train.json \$data_dir/test.json --hyper_params "nettype=EmbedDKT;ku_num=int(835);hidden_num=int(900);latent_dim=int(600);dropout=float(0.5)" --loss_params "lr=float(0.1);lw1=float(0.003);lw2=float(3.0)" --ctx="cuda:0" --model_name EmbedDKT+ --root=$HOME/TKT --root_data_dir=\$root/data/ktbd/\$dataset --data_dir=\$root_data_dir --dataset=junyi
```


## Appendix

### Model
There are a lot of models that implements different knowledge tracing models in different frameworks, 
the following are the url of those implemented by python (the stared is the authors version):

* DKT [[tensorflow]](https://github.com/mhagiwara/deep-knowledge-tracing)

* DKT+ [[tensorflow*]](https://github.com/ckyeungac/deep-knowledge-tracing-plus)

* DKVMN [[mxnet*]](https://github.com/jennyzhang0215/DKVMN)

* KTM [[libfm]](https://github.com/jilljenn/ktm)

* EKT[[pytorch*]](https://github.com/bigdata-ustc/ekt)

### Dataset
There are some datasets which are suitable for this task, 
you can refer to [BaseData ktbd doc](https://github.com/bigdata-ustc/EduData/blob/master/docs/ktbd.md) 
for these datasets