1 Star 1 Fork 0

郭少强/deeplearning-note

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
BERT.ipynb 59.90 KB
一键复制 编辑 原始数据 按行查看 历史
Mario Cho 提交于 3年前 . Update BERT.ipynb

BERT

URL:

##:MeCab

!apt install aptitude swig
!apt install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
Reading package lists... Done
Building dependency tree       
Reading state information... Done
aptitude is already the newest version (0.8.10-6ubuntu1).
swig is already the newest version (3.0.12-1).
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+main-1)
git is already installed at the requested version (1:2.17.1-1ubuntu0.7)
make is already installed at the requested version (4.1-9.1ubuntu1)
curl is already installed at the requested version (7.58.0-2ubuntu3.8)
xz-utils is already installed at the requested version (5.2.2-1.3)
file is already installed at the requested version (1:5.32-2ubuntu0.3)
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+main-1)
git is already installed at the requested version (1:2.17.1-1ubuntu0.7)
make is already installed at the requested version (4.1-9.1ubuntu1)
curl is already installed at the requested version (7.58.0-2ubuntu3.8)
xz-utils is already installed at the requested version (5.2.2-1.3)
file is already installed at the requested version (1:5.32-2ubuntu0.3)
No packages will be installed, upgraded, or removed.
0 packages upgraded, 0 newly installed, 0 to remove and 29 not upgraded.
Need to get 0 B of archives. After unpacking 0 B will be used.
                            
!pip install mecab-python3
Requirement already satisfied: mecab-python3 in /usr/local/lib/python3.6/dist-packages (0.996.5)
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a
fatal: destination path 'mecab-ipadic-neologd' already exists and is not an empty directory.
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
[make-mecab-ipadic-NEologd] : Original mecab-ipadic file is already there.
[make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file
[make-mecab-ipadic-NEologd] : Delete old mecab-ipadic-2.7.0-20070801-neologd-20200430 directory
mecab-ipadic-2.7.0-20070801/
mecab-ipadic-2.7.0-20070801/README
mecab-ipadic-2.7.0-20070801/AUTHORS
mecab-ipadic-2.7.0-20070801/COPYING
mecab-ipadic-2.7.0-20070801/ChangeLog
mecab-ipadic-2.7.0-20070801/INSTALL
mecab-ipadic-2.7.0-20070801/Makefile.am
mecab-ipadic-2.7.0-20070801/Makefile.in
mecab-ipadic-2.7.0-20070801/NEWS
mecab-ipadic-2.7.0-20070801/aclocal.m4
mecab-ipadic-2.7.0-20070801/config.guess
mecab-ipadic-2.7.0-20070801/config.sub
mecab-ipadic-2.7.0-20070801/configure
mecab-ipadic-2.7.0-20070801/configure.in
mecab-ipadic-2.7.0-20070801/install-sh
mecab-ipadic-2.7.0-20070801/missing
mecab-ipadic-2.7.0-20070801/mkinstalldirs
mecab-ipadic-2.7.0-20070801/Adj.csv
mecab-ipadic-2.7.0-20070801/Adnominal.csv
mecab-ipadic-2.7.0-20070801/Adverb.csv
mecab-ipadic-2.7.0-20070801/Auxil.csv
mecab-ipadic-2.7.0-20070801/Conjunction.csv
mecab-ipadic-2.7.0-20070801/Filler.csv
mecab-ipadic-2.7.0-20070801/Interjection.csv
mecab-ipadic-2.7.0-20070801/Noun.adjv.csv
mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv
mecab-ipadic-2.7.0-20070801/Noun.csv
mecab-ipadic-2.7.0-20070801/Noun.demonst.csv
mecab-ipadic-2.7.0-20070801/Noun.nai.csv
mecab-ipadic-2.7.0-20070801/Noun.name.csv
mecab-ipadic-2.7.0-20070801/Noun.number.csv
mecab-ipadic-2.7.0-20070801/Noun.org.csv
mecab-ipadic-2.7.0-20070801/Noun.others.csv
mecab-ipadic-2.7.0-20070801/Noun.place.csv
mecab-ipadic-2.7.0-20070801/Noun.proper.csv
mecab-ipadic-2.7.0-20070801/Noun.verbal.csv
mecab-ipadic-2.7.0-20070801/Others.csv
mecab-ipadic-2.7.0-20070801/Postp-col.csv
mecab-ipadic-2.7.0-20070801/Postp.csv
mecab-ipadic-2.7.0-20070801/Prefix.csv
mecab-ipadic-2.7.0-20070801/Suffix.csv
mecab-ipadic-2.7.0-20070801/Symbol.csv
mecab-ipadic-2.7.0-20070801/Verb.csv
mecab-ipadic-2.7.0-20070801/char.def
mecab-ipadic-2.7.0-20070801/feature.def
mecab-ipadic-2.7.0-20070801/left-id.def
mecab-ipadic-2.7.0-20070801/matrix.def
mecab-ipadic-2.7.0-20070801/pos-id.def
mecab-ipadic-2.7.0-20070801/rewrite.def
mecab-ipadic-2.7.0-20070801/right-id.def
mecab-ipadic-2.7.0-20070801/unk.def
mecab-ipadic-2.7.0-20070801/dicrc
mecab-ipadic-2.7.0-20070801/RESULT
[make-mecab-ipadic-NEologd] : Configure custom system dictionary on /content/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200430
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets $(MAKE)... yes
checking for working aclocal-1.4... missing
checking for working autoconf... missing
checking for working automake-1.4... missing
checking for working autoheader... missing
checking for working makeinfo... missing
checking for a BSD-compatible install... /usr/bin/install -c
checking for mecab-config... /usr/bin/mecab-config
configure: creating ./config.status
config.status: creating Makefile
[make-mecab-ipadic-NEologd] : Encode the character encoding of system dictionary resources from EUC_JP to UTF-8
./../../libexec/iconv_euc_to_utf8.sh ./Adverb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Interjection.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.org.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.place.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Auxil.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Symbol.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Prefix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Verb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.proper.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.name.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Conjunction.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Filler.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.nai.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Suffix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.verbal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.demonst.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adnominal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp-col.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adj.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.number.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adjv.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adverbal.csv 
rm ./Adverb.csv 
rm ./Interjection.csv 
rm ./Noun.org.csv 
rm ./Noun.place.csv 
rm ./Auxil.csv 
rm ./Symbol.csv 
rm ./Prefix.csv 
rm ./Verb.csv 
rm ./Noun.proper.csv 
rm ./Noun.name.csv 
rm ./Conjunction.csv 
rm ./Others.csv 
rm ./Filler.csv 
rm ./Noun.csv 
rm ./Noun.nai.csv 
rm ./Suffix.csv 
rm ./Noun.verbal.csv 
rm ./Postp.csv 
rm ./Noun.demonst.csv 
rm ./Adnominal.csv 
rm ./Postp-col.csv 
rm ./Adj.csv 
rm ./Noun.number.csv 
rm ./Noun.adjv.csv 
rm ./Noun.others.csv 
rm ./Noun.adverbal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./pos-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./feature.def 
./../../libexec/iconv_euc_to_utf8.sh ./right-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./char.def 
./../../libexec/iconv_euc_to_utf8.sh ./left-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./matrix.def 
./../../libexec/iconv_euc_to_utf8.sh ./rewrite.def 
./../../libexec/iconv_euc_to_utf8.sh ./unk.def 
rm ./pos-id.def 
rm ./feature.def 
rm ./right-id.def 
rm ./char.def 
rm ./left-id.def 
rm ./matrix.def 
rm ./rewrite.def 
rm ./unk.def 
mv ./left-id.def.utf8 ./left-id.def 
mv ./Others.csv.utf8 ./Others.csv 
mv ./Postp.csv.utf8 ./Postp.csv 
mv ./char.def.utf8 ./char.def 
mv ./Filler.csv.utf8 ./Filler.csv 
mv ./Noun.csv.utf8 ./Noun.csv 
mv ./Symbol.csv.utf8 ./Symbol.csv 
mv ./Prefix.csv.utf8 ./Prefix.csv 
mv ./Adj.csv.utf8 ./Adj.csv 
mv ./Noun.name.csv.utf8 ./Noun.name.csv 
mv ./Interjection.csv.utf8 ./Interjection.csv 
mv ./matrix.def.utf8 ./matrix.def 
mv ./pos-id.def.utf8 ./pos-id.def 
mv ./Noun.proper.csv.utf8 ./Noun.proper.csv 
mv ./Noun.org.csv.utf8 ./Noun.org.csv 
mv ./Auxil.csv.utf8 ./Auxil.csv 
mv ./Noun.adjv.csv.utf8 ./Noun.adjv.csv 
mv ./Noun.place.csv.utf8 ./Noun.place.csv 
mv ./Noun.nai.csv.utf8 ./Noun.nai.csv 
mv ./Adverb.csv.utf8 ./Adverb.csv 
mv ./Adnominal.csv.utf8 ./Adnominal.csv 
mv ./Verb.csv.utf8 ./Verb.csv 
mv ./Noun.demonst.csv.utf8 ./Noun.demonst.csv 
mv ./Noun.verbal.csv.utf8 ./Noun.verbal.csv 
mv ./Noun.number.csv.utf8 ./Noun.number.csv 
mv ./Postp-col.csv.utf8 ./Postp-col.csv 
mv ./rewrite.def.utf8 ./rewrite.def 
mv ./Noun.others.csv.utf8 ./Noun.others.csv 
mv ./right-id.def.utf8 ./right-id.def 
mv ./unk.def.utf8 ./unk.def 
mv ./Conjunction.csv.utf8 ./Conjunction.csv 
mv ./Noun.adverbal.csv.utf8 ./Noun.adverbal.csv 
mv ./feature.def.utf8 ./feature.def 
mv ./Suffix.csv.utf8 ./Suffix.csv 
[make-mecab-ipadic-NEologd] : Fix yomigana field of IPA dictionary
patching file Noun.csv
patching file Noun.place.csv
patching file Verb.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.adverbal.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.others.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Prefix.csv
patching file Suffix.csv
patching file Noun.proper.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Suffix.csv
patching file Noun.demonst.csv
patching file Noun.csv
patching file Noun.name.csv
[make-mecab-ipadic-NEologd] : Copy user dictionary resource
[make-mecab-ipadic-NEologd] : Install adverb entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adverb-dict-seed.20150623.csv.xz
[make-mecab-ipadic-NEologd] : Install interjection entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-interjection-dict-seed.20170216.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-common-noun-ortho-variant-dict-seed.20170228.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-proper-noun-ortho-variant-dict-seed.20161110.csv.xz
[make-mecab-ipadic-NEologd] : Install entries of orthographic variant of a noun used as verb form using /content/mecab-ipadic-neologd/libexec/../seed/neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv.xz
[make-mecab-ipadic-NEologd] : Install frequent adjective orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-std-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] : Install infrequent adjective orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-exp-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] : Install adjective verb orthographic variant entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-verb-dict-seed.20160324.csv.xz
[make-mecab-ipadic-NEologd] : Install infrequent datetime representation entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-date-time-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] : Install infrequent quantity representation entries using /content/mecab-ipadic-neologd/libexec/../seed/neologd-quantity-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] : Install entries of ill formed words using /content/mecab-ipadic-neologd/libexec/../seed/neologd-ill-formed-words-dict-seed.20170127.csv.xz
[make-mecab-ipadic-NEologd] : Re-Index system dictionary
reading ./unk.def ... 40
emitting double-array: 100% |###########################################| 
./model.def is not found. skipped.
reading ./neologd-common-noun-ortho-variant-dict-seed.20170228.csv ... 152869
reading ./Adverb.csv ... 3032
reading ./Interjection.csv ... 252
reading ./Noun.org.csv ... 17149
reading ./neologd-adjective-std-dict-seed.20151126.csv ... 507812
reading ./Noun.place.csv ... 73194
reading ./Auxil.csv ... 199
reading ./Symbol.csv ... 208
reading ./neologd-proper-noun-ortho-variant-dict-seed.20161110.csv ... 138379
reading ./neologd-quantity-infreq-dict-seed.20190415.csv ... 229216
reading ./Prefix.csv ... 224
reading ./neologd-date-time-infreq-dict-seed.20190415.csv ... 16866
reading ./Verb.csv ... 130750
reading ./Noun.proper.csv ... 27493
reading ./Noun.name.csv ... 34215
reading ./Conjunction.csv ... 171
reading ./neologd-adverb-dict-seed.20150623.csv ... 139792
reading ./Others.csv ... 2
reading ./Filler.csv ... 19
reading ./Noun.csv ... 60734
reading ./neologd-ill-formed-words-dict-seed.20170127.csv ... 60616
reading ./Noun.nai.csv ... 42
reading ./Suffix.csv ... 1448
reading ./neologd-adjective-verb-dict-seed.20160324.csv ... 20268
reading ./Noun.verbal.csv ... 12150
reading ./neologd-interjection-dict-seed.20170216.csv ... 4701
reading ./Postp.csv ... 146
reading ./Noun.demonst.csv ... 120
reading ./Adnominal.csv ... 135
reading ./Postp-col.csv ... 91
reading ./Adj.csv ... 27210
reading ./mecab-user-dict-seed.20200430.csv ... 3190186
reading ./neologd-adjective-exp-dict-seed.20151126.csv ... 1051146
reading ./Noun.number.csv ... 42
reading ./Noun.adjv.csv ... 3328
reading ./Noun.others.csv ... 153
reading ./Noun.adverbal.csv ... 808
reading ./neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv ... 26058
emitting double-array: 100% |###########################################| 
reading ./matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
[make-mecab-ipadic-NEologd] : Make custom system dictionary on /content/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200430
make: Nothing to be done for 'all'.
[make-mecab-ipadic-NEologd] : Finish..
[install-mecab-ipadic-NEologd] : Get results of tokenize test
[test-mecab-ipadic-NEologd] : Start..
[test-mecab-ipadic-NEologd] : Replace timestamp from 'git clone' date to 'git commit' date
[test-mecab-ipadic-NEologd] : Get buzz phrases
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1573  100  1573    0     0   1237      0  0:00:01  0:00:01 --:--:--  1237
[test-mecab-ipadic-NEologd] : Get difference between default system dictionary and mecab-ipadic-NEologd
[test-mecab-ipadic-NEologd] : Tokenize phrase using default system dictionary
[test-mecab-ipadic-NEologd] : Tokenize phrase using mecab-ipadic-NEologd
[test-mecab-ipadic-NEologd] : Get result of diff
[test-mecab-ipadic-NEologd] : Please check difference between default system dictionary and mecab-ipadic-NEologd

default system dictionary	  |	mecab-ipadic-NEologd

[test-mecab-ipadic-NEologd] : Finish..

[install-mecab-ipadic-NEologd] : Please check the list of differences in the upper part.

[install-mecab-ipadic-NEologd] : Do you want to install mecab-ipadic-NEologd? Type yes or no.
[install-mecab-ipadic-NEologd] : OK. Let's install mecab-ipadic-NEologd.
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : /usr/lib/x86_64-linux-gnu/mecab/dic is current user's directory
[install-mecab-ipadic-NEologd] : Make install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd
make[1]: Entering directory '/content/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20200430'
make[1]: Nothing to be done for 'install-exec-am'.
/bin/bash ./mkinstalldirs /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd
 /usr/bin/install -c -m 644 ./matrix.bin /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/matrix.bin
 /usr/bin/install -c -m 644 ./char.bin /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/char.bin
 /usr/bin/install -c -m 644 ./sys.dic /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/sys.dic
 /usr/bin/install -c -m 644 ./unk.dic /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/unk.dic
 /usr/bin/install -c -m 644 ./left-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/left-id.def
 /usr/bin/install -c -m 644 ./right-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/right-id.def
 /usr/bin/install -c -m 644 ./rewrite.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/rewrite.def
 /usr/bin/install -c -m 644 ./pos-id.def /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/pos-id.def
 /usr/bin/install -c -m 644 ./dicrc /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/dicrc
make[1]: Leaving directory '/content/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20200430'

[install-mecab-ipadic-NEologd] : Install completed.
[install-mecab-ipadic-NEologd] : When you use MeCab, you can set '/usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd' as a value of '-d' option of MeCab.
[install-mecab-ipadic-NEologd] : Usage of mecab-ipadic-NEologd is here.
Usage:
    $ mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd ...

[install-mecab-ipadic-NEologd] : Finish..
[install-mecab-ipadic-NEologd] : Finish..
import subprocess

cmd='echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
path_neologd = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
                           shell=True).communicate()[0]).decode('utf-8')

2:MeCab

import MeCab

m=MeCab.Tagger("-Ochasen")

text = "예제。"

text_segmented = m.parse(text)
print(text_segmented)
EOS

m=MeCab.Tagger("-Owakati")
text_segmented = m.parse(text)
print(text_segmented)
 

m=MeCab.Tagger("-Oyomi")
text_segmented = m.parse(text)
print(text_segmented)

m=MeCab.Tagger("-Ochasen -d "+str(path_neologd))  # NEologd

text = ""

text_segmented = m.parse(text)
print(text_segmented)
EOS

m=MeCab.Tagger("-Owakati -d "+str(path_neologd))  # NEologdへのパスを追加
text_segmented = m.parse(text)
print(text_segmented)

3:BERT

!pip install transformers==2.9.0
Requirement already satisfied: transformers==2.9.0 in /usr/local/lib/python3.6/dist-packages (2.9.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (2.23.0)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (4.41.1)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (0.0.43)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (0.1.86)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (3.0.12)
Requirement already satisfied: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (0.7.0)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (0.7)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (2019.12.20)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers==2.9.0) (1.18.4)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==2.9.0) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==2.9.0) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==2.9.0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers==2.9.0) (2.9)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==2.9.0) (0.14.1)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==2.9.0) (7.1.2)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers==2.9.0) (1.12.0)
import torch
from transformers.modeling_bert import BertModel
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
# 分かち書きをするtokenizerです
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking')
# BERT
model = BertModel.from_pretrained('bert-base-japanese-whole-word-masking')
print(model)
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (10): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)
from transformers import BertConfig

# 
config_japanese = BertConfig.from_pretrained('bert-base-japanese-whole-word-masking')
print(config_japanese)
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 32000
}

BERT

text1 = ""
text2 = ""
text3 = ""
# 
input_ids1 = tokenizer.encode(text1, return_tensors='pt')  # ptはPyTorchの略

print(tokenizer.convert_ids_to_tokens(input_ids1[0].tolist()))  # 文章
print(input_ids1)  # id
['[CLS]', ]
tensor([[    2,   811,    11, 13700,     7,    58,    10,     8,     3]])
# 
input_ids2 = tokenizer.encode(text2, return_tensors='pt')  # ptはPyTorch

print(tokenizer.convert_ids_to_tokens(input_ids2[0].tolist()))  # 
print(input_ids2)  # id
['[CLS]', '[SEP]']
tensor([[    2,  5521,  3118,  4027,    12, 13700,    14,  4897, 28457,     8,
             3]])
# id
input_ids3 = tokenizer.encode(text3, return_tensors='pt')  # ptはPyTorchの略

print(tokenizer.convert_ids_to_tokens(input_ids3[0].tolist()))  # 文章
print(input_ids3)  # id
['[CLS]', '[SEP]']
tensor([[   2,  811,   11, 7279,   26,   20,   10,    8,    3]])
# BERT
result1 = model(input_ids1)

print(result1[0].shape)
print(result1[1].shape)

# reult sequence_output, pooled_output, (hidden_states), (attentions)。
# 。
torch.Size([1, 9, 768])
torch.Size([1, 768])
# ERT
result2 = model(input_ids2)
result3 = model(input_ids3)

word_vec1 = result1[0][0][3][:]  # 1
word_vec2 = result2[0][0][5][:]  # 2
word_vec3 = result3[0][0][3][:]  # 3ㅅ
# 
cos = torch.nn.CosineSimilarity(dim=0)
cos_sim_12 = cos(word_vec1, word_vec2)
cos_sim_13 = cos(word_vec1, word_vec3)

print(cos_sim_12)
print(cos_sim_13)
tensor(0.6647, grad_fn=<DivBackward0>)
tensor(0.7841, grad_fn=<DivBackward0>)
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/guo_shaoqiang/deeplearning-note.git
git@gitee.com:guo_shaoqiang/deeplearning-note.git
guo_shaoqiang
deeplearning-note
deeplearning-note
master

搜索帮助