# 关键短语感知的多新闻标题生成算法-四川大学

**Repository Path**: dicalab/KeyMultiHeadline

## Basic Information

- **Project Name**: 关键短语感知的多新闻标题生成算法-四川大学
- **Description**: 基于新闻中不同的关键词短语，生成多样化的新闻标题
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 2
- **Forks**: 0
- **Created**: 2021-07-05
- **Last Updated**: 2023-01-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation

This repo contains the code and data of the following paper:
> **Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation**, Dayiheng Liu, Yeyun Gong, Yu Yan, Jie Fu, Bo Shao, Daxin Jiang, Jiancheng Lv, Nan Duan, EMNLP2020 [[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.505.pdf)


# Prerequisites
- Python 3.6
- numba 0.49.1
- tensorflow 1.10.0
- numpy 1.16.2
- nltk 3.3+
- cuda 9.0


# Datasets
Download the dataset file at [here](https://drive.google.com/file/d/17xEdwdXwLar1w7JkRqnsXh1n6kFokodN/view?usp=sharing).
```
tar -xzvf keyaware_news_emnlp20.tar
```
The data file directory is as follows
```
dataset/
|-- dev_keyaware_news_KQTAC.txt
|-- test_keyaware_news_KQTAC.txt
|-- test_keyaware_news_KQTAC_5slot.txt
|-- test_keyaware_news_KQTAC_multi.txt
`-- train_keyaware_news_KQTAC.txt
```
The files `train_keyaware_news_KQTAC.txt`, `dev_keyaware_news_KQTAC.txt`, and `test_keyaware_news_KQTAC.txt` contain 5-tuple <keyphrase, query, title, article, click_times>.

The file `test_keyaware_news_KQTAC_multi.txt` provides 5 tuples with different predicted keyphrases for each test example. For each article, we obtained 5 keyphrases by the SEQ2SEQ model as described in our paper.

Similarly, the file `test_keyaware_news_KQTAC_5slot.txt` provides 5 tuples with different predicted keyphrases for each test example. For each article, we obtained 5 keyphrases by the SLOT model as described in our paper.

# Baselines

## Headline Generation
Our headline generation baselines are based on BERT-base-uncased model, which can be downloaded at [here](https://drive.google.com/file/d/13K_OUOJvwTAFvaPs9faub49zfd28aWKq/view?usp=sharing).

run ``run_base.sh`` for BASE model training and testing.

run ``run.sh`` for our model training and testing.

The detailed hyper-parameters can be found in `run.sh`  and `config.py`.

The model checkpoints and log file will be saved at `OUTPUT_DATA_DIR` and `LOG_FILE` in `run.sh`, respectively.

Note that we also provide some variants of the keyphrase-aware headline generation model and keyphrase-agnostic baselines, which can be found in `model_pools/`. If you want to use other baselines, please replace the `MODEL=${2:-encoder_filter_query_plus_decoder_mem}` in `run.sh` to other models (the model names can be found in `model_pools/__init__.py`).

## Keyphrase Generation

To training the SEQ2SEQ model for keyphrase generation, please replace the content of the `title` with `key` for each sample in the `train_keyaware_news_KQTAC.txt`, `dev_keyaware_news_KQTAC.txt`, and `test_keyaware_news_KQTAC.txt`. After that, run ``run_base.sh`` to use the BASE model for keyphrase generation. If you want to generate diverse keyphrases, please set `--use_diverse_beam_search` and tune `--decode_gamma` to control the diverse penalty.

To training the SLOT model for keyphrase generation, we adopt the implementation of the answer span prediction provided by Huggingface, please refer to the code [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).


# Citation
```
@inproceedings{liu2020keynews,
    title = "Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation",
    author="Liu, Dayiheng and Gong, Yeyun and Yan, Yu and Fu, Jie and Shao, Bo and Jiang, Daxin and Lv, Jiancheng and Duan, Nan",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020"
}
```