# cw2vec **Repository Path**: li-rr/cw2vec ## Basic Information - **Project Name**: cw2vec - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-11-01 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## Introduction ## Paper Link: [cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf) Paper Detail Summary: [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/) ## Requirements ## >cmake version 3.10.0-rc5 >make GNU Make 4.1 >gcc version 5.4.0 ## Run Demo ## - I have uploaded `word2vec` binary executable file in `cw2vec/word2vec/bin` and rewrite `run.sh` for simple test, you can run `run.sh` directly for simple test. - According to the *[Building cw2vec using cmake](https://github.com/bamtercelboo/cw2vec#building-cw2vec-using-cmake)* to recompile and run other model with the *[Example use cases](https://github.com/bamtercelboo/cw2vec#example-use-cases)*. ## Building cw2vec using cmake ## git clone git@github.com:bamtercelboo/cw2vec.git cd cw2vec && cd word2vec && cd build cmake .. make cd ../bin This will create the word2vec binary and also all relevant libraries. ## Example use cases ## the repo not only implement cw2vec(named **substoke**), but also the **skipgram**, **cbow** of word2vec, furthermore, fasttext skipgram is implemented(named **subword**). Please modify train.txt and feature.txt into your own train document. skipgram: ./word2vec skipgram -input train.txt -output skipgram_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100 cbow: ./word2vec cbow -input train.txt -output cbow_out -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100 subword: ./word2vec subword -input train.txt -output subword_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 6 -thread 8 -t 1e-4 -lrUpdateRate 100 substoke: ./word2vec substoke -input train.txt -infeature feature.txt -output substoke_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 18 -thread 8 -t 1e-4 -lrUpdateRate 100 ## Get chinese stoke feature ## substoke model need chinese stoke feature(`-infeature`),I have written a script to acquire the Chinese character of stroke information from [handian](http://www.zdic.net/). here is the script [extract_zh_char_stoke](https://github.com/bamtercelboo/corpus_process_script/tree/master/extract_zh_char_stoke), see the readme for details. Now, I have uploaded a file of stroke features in simplified Chinese, which contains a total of 20901 Chinese characters for use. The file in the [Simplified_Chinese_Feature](https://github.com/bamtercelboo/cw2vec/blob/master/Simplified_Chinese_Feature/sin_chinese_feature.txt) folder. Or you can use the above script to get it yourself. **feature file(feature.txt) like this**: 中 丨フ一丨 国 丨フ一一丨一丶一 庆 丶一ノ一ノ丶 假 ノ丨フ一丨一一フ一フ丶 期 一丨丨一一一ノ丶ノフ一一 香 ノ一丨ノ丶丨フ一一 江 丶丶一一丨一 将 丶一丨ノフ丶一丨丶 涌 丶丶一フ丶丨フ一一丨 入 ノ丶 人 ノ丶 潮 丶丶一一丨丨フ一一一丨ノフ一一 ...... I provided a feature file for the test,path is `sample/substoke_feature.txt`. ## Substoke model output embeddings ## - In this paper, the context word embeddings is used directly as the final word vector. However, according to the idea of fasttext, I also take into account the n-gram feature vector of the stroke information, the n-gram feature vector of the stroke information is taken as an average substitute for the word vector. - There are two outputs in substoke model: - output ends with vec is the context word vector. - output ends with avg is the n-gram feature vector average. ## Word similarity evaluation ## #### 1. Evaluation script #### I have already written a Chinese word similarity evaluation script. [Chinese-Word-Similarity-and-Word-Analogy](https://github.com/bamtercelboo/Chinese_Word_Similarity_and_Word_Analogy), see the readme for details. #### 2. Parameter Settings #### The parameters are set as follows: dim 100 window sizes 5 negative 5 epoch 5 minCount 10 lr skipgram(0.025),cbow(0.05),substoke(0.025) n-gram minn=3, maxn=18 #### 3. result #### Experimental results show follows ![](https://i.imgur.com/u0O6RoE.jpg) ![](https://i.imgur.com/p4gjsaD.jpg) ## Full documentation ## Invoke a command without arguments to list available arguments and their default values: ./word2vec usage: word2vec The commands supported by word2vec are: skipgram ------ train word embedding by use skipgram model cbow ------ train word embedding by use cbow model subword ------ train word embedding by use subword(fasttext skipgram) model substoke ------ train chinses character embedding by use substoke(cw2vec) model ./word2vec substoke -h Train Embedding By Using [substoke] model Here is the help information! Usage: The Following arguments are mandatory: -input training file path -infeature substoke feature file path -output output file path The Following arguments are optional: -verbose verbosity level[2] The following arguments for the dictionary are optional: -minCount minimal number of word occurences default:[10] -bucket number of buckets default:[2000000] -minn min length of char ngram default:[3] -maxn max length of char ngram default:[6] -t sampling threshold default:[0.001] The following arguments for training are optional: -lr learning rate default:[0.05] -lrUpdateRate change the rate of updates for the learning rate default:[100] -dim size of word vectors default:[100] -ws size of the context window default:[5] -epoch number of epochs default:[5] -neg number of negatives sampled default:[5] -loss loss function {ns} default:[ns] -thread number of threads default:[1] -pretrainedVectors pretrained word vectors for supervised learning default:[] -saveOutput whether output params should be saved default:[false] ## References ## [1] [Cao, Shaosheng, et al. "cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information." (2018). ](http://www.statnlp.org/wp-content/uploads/papers/2018/cw2vec/cw2vec.pdf) [2][ Bojanowski, Piotr, et al. "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606 (2016).](https://arxiv.org/pdf/1607.04606.pdf) [3] [fastText-github](https://github.com/facebookresearch/fastText) [4] [cw2vec理论及其实现](https://bamtercelboo.github.io/2018/05/11/cw2vec/) ## Question ## - if you have any question, you can open a issue or email bamtercelboo@{gmail.com, 163.com}. - if you have any good suggestions, you can PR or email me.