# HGAT **Repository Path**: milo7hao/HGAT ## Basic Information - **Project Name**: HGAT - **Description**: Heterogeneous graph attention network for semi-supervised short text classification (EMNLP 2019) - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-06 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README An implement of EMNLP 2019 paper "[Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification](http://shichuan.org/doc/74.pdf)". Thank you for your interest in our work! :smile: # Requirements - Anaconda3 (python 3.6) - Pytorch 1.3.1 - gensim 3.6.0 # Easy Run ``` cd ./model/code/ python train.py ``` You may change the dataset by modifying the variable "dataset = 'example'" in the top of the code "train.py" or use arguments (see train.py). Our datasets can be downloaded from [Google Drive](https://drive.google.com/open?id=1pz1IMdJqkKidD7eEc3T_2-VkrUhkUKd4). PS: I have accidentally deleted some files, but I tried to restore them, hope they will run correctly. # Prepare for your own dataset The following files are required: ./model/data/YourData/ ---- YourData.cites // the adjcencies ---- YourData.content.text // the features of texts ---- YourData.content.entity // the features of entities ---- YourData.content.topic // the features of topics ---- train.map // the index of the training node ---- vali.map // the index of the validation nodes ---- test.map // the index of the testing nodes The format is as following: - **YourData.cites** Each line contains an edge: "idx1\tidx2\n". eg: "98 13" - **YourData.content.text** Each line contains a node: "idx\t[features]\t[category]\n", note that the [features] is a list of floats with '\t' as the delimiter. eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 computers" If used for multi-label classification, [category] must be one-hot with space as delimiter, eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 0 1 1 0 1 0". - **YourData.content.entity** Similar with .text, just change the [category] to "entity". eg: "13 0.0 0.0 1.0 0.0 0.0 entity" - **YourData.content.topic** Similar with .text, just change the [category] to "topic". eg: "64 0.10 1.21 8.09 0.10 topic" - ***.map** Each line contains an index: "idx\n". eg: "98" You can see the example in ./model/data/example/* ---- A simple data preprocessing code is provided. Successfully running it requires a token of [tagme](https://sobigdata.d4science.org/web/tagme/tagme-help "TagMe")'s account (my personal token is provided in tagme.py, but may be invalid in the future), [Wikipedia](https://dumps.wikimedia.org/ "WikiPedia")'s entity descriptions, and a word2vec model containing entity embeddings. You can prepare them yourself or obtain our files from [Google Drive](https://drive.google.com/open?id=1v9GD5ezHGbekoLDw5aAzh6-C-QUS-j93) and unzip them to ./data/ . Then, you should prepare a data file like ./data/example/example.txt, whose format is: "[idx]\t[category]\t[content]\n". Finally, modify the variable "dataset = 'example'" in the top of following codes and run: ``` python tagMe.py python build_network.py python build_features.py python build_data.py ``` # Use HGAT as GNN If you just wanna use the HGAT model as a graph neural network, you can just prepare some files following the above format: ./model/data/YourData/ ---- YourData.cites // the adjcencies ---- YourData.content.* // the features of *, namely node_type1, node_type2, ... ---- train.map // the index of the training node ---- vali.map // the index of the validation nodes ---- test.map // the index of the testing nodes And change the "load_data()" in ./model/code/utils.py ``` type_list = [node_type1, node_type2, ...] type_have_label = node_type ``` See the codes for more details. # Citation If you make advantage of the HGAT model in your research, please cite the following in your manuscript: ``` @inproceedings{linmei2019heterogeneous, title={Heterogeneous graph attention networks for semi-supervised short text classification}, author={Linmei, Hu and Yang, Tianchi and Shi, Chuan and Ji, Houye and Li, Xiaoli}, booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, pages={4823--4832}, year={2019} } ```