# SLM

**Repository Path**: greitzmann/SLM

## Basic Information

- **Project Name**: SLM
- **Description**: The implementation of http://aclweb.org/anthology/D18-1531
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-01-05
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

### Segmental Language Models

**Introduction**

A PyTorch Implementation of [Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling](https://aclweb.org/anthology/D18-1531)

**Implemented features**

Models:

- Unsupervised Learning with Segmental Language Models
- Supervised Learning with Segmental Language Models

**Usage**

Chinese Corpus:

- *segmented.txt*: segmented data set for supervised training
- *unsegmented.txt*: unsegmented data set. You can use both this data set and *test.txt* for unsupervised training
- *test.txt*: unsegmented data set for evaluation
- *test_gold.txt*: gold segmented test data set

**Train**

For example, this command train an unsupervised SLM model on pku dataset with maximal segment length 4 and GPU 0.

```shell
bash run.sh train unsupervised pku 4 0
```

Check run.sh and argparse configuration at codes/run.py for more arguments and more details.

**Predict**

```shell
bash run.sh predict unsupervised pku 4 0
```

**Evaluation**

```shell
bash run.sh eval unsupervised pku 4
```

**Speed**

The Segmental Language Models usually take about 30 - 50 minutes to converge, which depends on the maximal segment length (2 - 4).

**Unsupervised results of the SLM model (Maximal Segment Length = k)**

| Dataset | PKU           | MSR           | AS            | CityU         |
| ------- | ------------- | ------------- | ------------- | ------------- |
| k = 2   | 0.797 (0.802) | 0.776 (0.785) | 0.794 (0.794) | 0.786 (0.782) |
| k = 3   | 0.803 (0.798) | 0.784 (0.794) | 0.800 (0.803) | 0.803 (0.805) |
| k = 4   | 0.797 (0.792) | 0.782 (0.790) | 0.798 (0.804) | 0.798 (0.797) |

Note that this is a re-implementation of the SLM model. Due to the differences in detailed settings, such as data loader setting, dropout rate and learning rate, the re-implementation performance is a little different from what is reported in the paper.

**Using the library**

The python library is organized around 4 objects:

- InputDataset (dataloader.py): prepare data stream for training and evaluation
- CWSTokenizer (tokenization.py): work along with InputDataset for data pre-processing
- SegmentalLM (model.py): build the model and provide train/test API for SLM
- SLMConfig (model.py): manage configurations for SLM

The run.py file contains the main function, which parses arguments, reads data, initialize the model and provides the training loop.

**Citation**

If you use the codes, please cite the following [paper](https://aclweb.org/anthology/D18-1531):

```
@inproceedings{sun2018unsupervised,
  title={Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling},
  author={Sun, Zhiqing and Deng, Zhi-Hong},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={4915--4920},
  year={2018}
}
```