# final-text-classification **Repository Path**: liam1030/final-text-classification ## Basic Information - **Project Name**: final-text-classification - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2018-12-22 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CSDA Final Project - Text Classification ## Week1 1.完成了数据预处理、预测标签输出等数据处理的流程; 2.简单探索数据分布,使用gensim训练了w2v 3.完成了FastText、TextCNN,Attention-based Bi-LSTM,HAN等深度模型的训练(未使用预训练的word2vec) 4.下一步计划:读懂模型的原理和代码, 完成ensemble流程的构建(取不同模型的logit再平均);完成W2V的训练,将预训练的w2v加入模型重新训练;好的尝试Bert,尝试将Bert的核心思想应用在其他的模型(使用其他模型代替Bert中的transfomer) ### 各种模型验证集效果 **TextCNN训练结果** : 750个epoch之后,在验证集、测试集效果如下; micro是指用全部类别综合统计,macro指每个类别分别统计后,结果取平均 ```sh Epoch 749 Validation Loss:1.998 F1 Score:0.819 F1_micro:0.809 F1_macro:0.829 micro precison: 0.8326359484252741 ;recall: 0.8361344186498143 ;f1_score: 0.8343765164300705 macro precison: 0.8906248608398655 ;recall: 0.949999841666693 ;f1_score: 0.919349695656711 macro precison: 0.7213113571620725 ;recall: 0.8148146639232103 ;f1_score: 0.7652122767810275 macro precison: 0.8717946482577824 ;recall: 0.6938774094127735 ;f1_score: 0.7727221617050868 macro precison: 0.9374994140628662 ;recall: 0.882352422145634 ;f1_score: 0.9090853627458678 macro precison: 0.941175916955343 ;recall: 0.941175916955343 ;f1_score: 0.9411709169819052 macro precison: 0.6666655555574075 ;recall: 0.571427755103207 ;f1_score: 0.6153786982663629 macro precison: 0.6249996093752441 ;recall: 0.7692301775152481 ;f1_score: 0.6896497503329974 macro precison: 0.8333319444467593 ;recall: 0.7142846938790088 ;f1_score: 0.7692246154184614 macro precison: 0.9999985714306123 ;recall: 0.9999985714306123 ;f1_score: 0.9999935714556122 macro precison: 0.9999980000040001 ;recall: 0.9999980000040001 ;f1_score: 0.999993000029 macro precison: 0.9999950000249999 ;recall: 0.9999950000249999 ;f1_score: 0.9999900000499999 Test Loss:1.936 F1 Score:0.844 F1_micro:0.834 F1_macro:0.853 Test accuracy : 85.156250 % ``` **FastText训练结果** : 600个epoch之后,在验证集、测试集效果如下; ```sh Epoch 599 Validation Loss:1.787 Validation Accuracy: 0.773 Going to save checkpoint. number_examples for validation: 239 test_loss 1.759413429919411 test_acc 0.8277310924369747 ``` **Adversarial Training Methods For Supervised Text Classification训练结果** : 50个epoch之后,在验证集测试集合效果如下; ```sh Epoch 50 start ! Train Epoch time: 20.113 s validation accuracy: 0.806 Training finished, time consumed : 1305.248138666153 s Start evaluating: Test accuracy : 84.375000 % ``` **Hierarchical Attention Networks for Text Classification训练结果** : 20个epoch之后,在验证集测试集合效果如下; ```sh Epoch 20 start ! Validation accuracy: [0.75727747, 0.17591016] epoch finished, time consumed : 7.117572546005249 s Training finished, time consumed : 129.84537863731384 s Start evaluating: Test accuracy : 82.031250 % ``` **attn_bi_lstm训练结果** : 100个epoch之后,在验证集测试集合效果如下; ```sh Epoch 100 start ! Train Epoch time: 40.141 s validation accuracy: 0.845 Training finished, time consumed : 4088.4550220966339 s Start evaluating: Test accuracy : 80.625000 % ``` **independently_rnn_tc训练结果** : 100个epoch之后,在验证集测试集合效果如下; ```sh Epoch 100 start ! Validation accuracy: [0.8128342, 0.22283895] Epoch time: 5.67938494682312 s Training finished, time consumed : 581.5408520698547 s Start evaluating: Test accuracy : 82.031250 % ``` **attention is all your need训练结果** : 50个epoch之后,在验证集测试集合效果如下; ```sh Epoch 50 start ! Train Epoch time: 8.564 s validation accuracy: 0.644 Training finished, time consumed : 433.11831068992615 s Start evaluating: Test accuracy : 70.625000 % ``` ### Week1 工作细节 **分词,生成训练格式的txt** : 1.使用jieba分词后,将分词后的sequence和label导出到text,具体请看[pre-processing](https://gitee.com/liam1030/final-text-classification/blob/master/pre-processing_mydata.ipynb) > ```py train_data=pd.read_csv(base_path+'training.csv', encoding="utf-8",header=None) train_data.columns = ['label', 'sentence'] #jieba分词 words = [None]*len(train_data) i=0 for row in train_data['sentence']: seg_list = jieba.cut(row) words[i] = (" ".join(seg_list)) #print(i, words[i]) i+=1 train_data['words'] = words ``` **生成word/label与index相互映射字典;拆分train/valid/test set,输出X、Y** : 1.生成vocabulary_word2index,vocabulary_index2word,vocabulary_label2index,vocabulary_index2label字典 > ```py def create_vocabulary(training_data_path,vocab_size,name_scope='cnn'): cache_vocabulary_label_pik='cache'+"_"+name_scope # path to save cache if not os.path.isdir(cache_vocabulary_label_pik): # create folder if not exists. os.makedirs(cache_vocabulary_label_pik) # if cache exists. load it; otherwise create it. cache_path =cache_vocabulary_label_pik+"/"+'vocab_label.pik' print("cache_path:",cache_path,"file_exists:",os.path.exists(cache_path)) if os.path.exists(cache_path): with open(cache_path, 'rb') as data_f: return pickle.load(data_f) else: vocabulary_word2index={} vocabulary_index2word={} vocabulary_word2index[_PAD]=PAD_ID vocabulary_index2word[PAD_ID]=_PAD vocabulary_word2index[_UNK]=UNK_ID vocabulary_index2word[UNK_ID]=_UNK vocabulary_label2index={} vocabulary_index2label={} #1.load raw data file_object = codecs.open(training_data_path, mode='r', encoding='utf-8') lines=file_object.readlines() #2.loop each line,put to counter c_inputs=Counter() c_labels=Counter() for line in lines: raw_list=line.strip().split("__label__") input_list = raw_list[0].strip().split(" ") input_list = [x.strip().replace(" ", "") for x in input_list if x != ''] label_list=[l.strip().replace(" ","") for l in raw_list[1:] if l!=''] c_inputs.update(input_list) c_labels.update(label_list) #return most frequency words vocab_list=c_inputs.most_common(vocab_size) label_list=c_labels.most_common() #put those words to dict for i,tuplee in enumerate(vocab_list): word,_=tuplee vocabulary_word2index[word]=i+2 vocabulary_index2word[i+2]=word for i,tuplee in enumerate(label_list): label,_=tuplee;label=str(label) vocabulary_label2index[label]=i vocabulary_index2label[i]=label #save to file system if vocabulary of words not exists. if not os.path.exists(cache_path): with open(cache_path, 'ab') as data_f: pickle.dump((vocabulary_word2index,vocabulary_index2word,vocabulary_label2index,vocabulary_index2label), data_f) return vocabulary_word2index,vocabulary_index2word,vocabulary_label2index,vocabulary_index2label ``` 2.拆分数据集,生成X、Y > ```py def load_data_multilabel(traning_data_path,vocab_word2index, vocab_label2index,sentence_len,training_portion=0.9): file_object = codecs.open(traning_data_path, mode='r', encoding='utf-8') lines = file_object.readlines() random.seed(123456) random.shuffle(lines) label_size=len(vocab_label2index) X = [] Y = [] sentence = [] for i,line in enumerate(lines): raw_list = line.strip().split("__label__") input_list = raw_list[0].strip().split(" ") input_list = [x.strip().replace(" ", "") for x in input_list if x != ''] x=[vocab_word2index.get(x,UNK_ID) for x in input_list] label_list = raw_list[1:] label_list=[l.strip().replace(" ", "") for l in label_list if l != ''] label_list=[vocab_label2index[label] for label in label_list] y=transform_multilabel_as_multihot(label_list,label_size) if i<10:print(raw_list[1],label_list,line) X.append(x) Y.append(y) sentence.append(line) #if i<10:print(i,"line:",line) X = pad_sequences(X, maxlen=sentence_len, value=0.) # padding to max length number_examples = len(lines) training_number=int(training_portion* number_examples) train = (X[0:training_number], Y[0:training_number],sentence[0:training_number]) valid_number=min(1000,(number_examples-training_number)//2) test_number = valid_number test = (X[training_number:training_number+valid_number], Y[training_number:training_number+valid_number],sentence[training_number:training_number+valid_number]) valid = (X[training_number+valid_number:], Y[training_number+valid_number:],sentence[training_number+valid_number:]) return train,test,valid ``` **训练w2v** : ```py from gensim.models.word2vec import Word2Vec # 训练词向量 def train_w2v_model(min_freq=5, size=128): sentences = [] corpus = pd.concat((train_data['words'], test_data['words'])) for e in tqdm(corpus): sentences.append([i for i in e.strip().split() if i]) print('训练集语料:', len(corpus)) print('总长度: ', len(sentences)) model = Word2Vec(sentences, size=size, window=5, min_count=min_freq) model.itos = {} model.stoi = {} model.embedding = {} print('保存模型...') for k in tqdm(model.wv.vocab.keys()): model.itos[model.wv.vocab[k].index] = k model.stoi[k] = model.wv.vocab[k].index model.embedding[model.wv.vocab[k].index] = model.wv[k] model.save('data/word2vec-models/word2vec_tc') return model model = train_w2v_model(size=128) model.wv.save_word2vec_format('data/word2vec-models/word2vec_tc.bin', binary=True) # train_df[:3] print('OK') ```