# Text-Classify **Repository Path**: ll533/text-classify ## Basic Information - **Project Name**: Text-Classify - **Description**: 100w新闻数据集的分类,基于SVM和朴素贝叶斯 - **Primary Language**: Python - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-02-06 - **Last Updated**: 2021-02-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Text-Classify #### 介绍 100w新闻数据集的分类,基于SVM和朴素贝叶斯 报告日期:2020年11月26日  实验目的 掌握数据预处理的方法,对训练集数据进行预处理; 掌握文本建模的方法,对语料库的文档进行建模; 掌握分类算法的原理,基于监督学习的机器学习方法,训练文本分类器; 利用学习的文本分类器,对未知文本进行分类判别; 掌握评价分类器性能的评估方法。 实验要求 文本类别>=10类。 训练集文档数>=500000篇,每类平均50000篇;测试集文档数>=500000篇,每类平均50000篇。 分组实验,组员数量<=3。 实验内容 3.1文本数据的爬取 本次实验爬取的网站是新浪新闻网,通过爬虫一共爬取了10个分类新闻,分别是auto,mil,sh,cj,cul,gj,gn,tw,ty,yl.但是经过试验发现,数据集分类之间差异性不够明显,导致模型的预测效果不够好,于是我又结合了THUCNews新闻数据集中的三个分类:educate,fangchan,technology.替换掉了cul,gn,gj这三个分类。之后考虑到相邻新闻之间的类似性,将数据隔行划分成训练集和测试集,每个数据集各5万。 我的爬取新闻过程主要采用了Request和BeautifulSoup库。爬取过程为:首先拼接2011年-2020年所有的滚动新闻页面的URL,然后将每个页面上的所有新闻的URL保存下来。接下来,遍历保存的全部URL,根据URL中包含的分类信息,将所有的URL根据分类的不同分别保存在不同的10个txt文件中。 然后分别遍历这十个分类的URL文件,对其中的每个链接,爬取其中的新闻内容,然后保存在相应的分类的数据库表中,这个过程会比较慢,而且有的网址失效或者存在有些网页内容编码异常等问题。当所有的数据都保存到数据库中后,然后再编写程序从数据库中将所有的新闻保存到本地的txt文件中。这里在程序的运行过程中,将奇数目的新闻保存为测试集,将偶数目的新闻保存为训练集 3.2数据预处理 首先需要去除每个新闻的停用词,这里使用jieba来进行切词的处理,根据stop_words_ch.txt文件进行停用词的去除。 sklearn.datasets,base包下的Bunch类可以帮助我们方便地定义数据集的数据结构。我们定义数据集的数据结构为: bunch = Bunch(target_name=[],lable=[],filenames=[],contents=[]) 其中target_name为数据集总共有多少类的数据,contents对应每条新闻,filenames对应保存这条新闻的TXT文件路径,lable对应这条数据的标签。读取TXT文件初始化bunch最后把bunch通过pickle保存到磁盘的trainData.dat文件中。用同样的方法得到测试数据集testData.dat。 3.3朴素贝叶斯算法 朴素贝叶斯原理: P(B|A) = (P(A|B)P(B))/(P(A)) 如果应用到文本分类中,我们假设有类别集合 C = {C_1,C_2,C_3,C_4,C_5,C_6,C_7,C_8,C_9,C_10},那么文档D属于类别C_i的概率就可以使用贝叶斯公式计算: P(C_i│D)= (P(D│C_i )P(C_i ))/P(D) = (P(C_i))/(P(D))*P(D|C_i) 因为对每一个分类来说,P(C_i)恒等于1/10,P(D)都相等,所以若要比较C1、C2…C10的大小,只需计算P(D|C_i)即可。 假设文档D的特征集合X有n个特征:X = {x_1,x_2…x_n} ,那么P(D|C_i)的计算公式是: P(D|Ci) =P(x_1 |Ci)+P(x_2 |Ci)…P(x_n |Ci) 令 P(C_k |D) =max{P(C_1 |D),P(C_2 |D)…P(C_10 |D)} 那么,我们就判断文档D属于类别C_k。 首先,将每篇文档转化为特征向量,即特征提取。文本分类中最著名的特征提取方法就是向量空间模型(VSM),即将样本转换为向量的形式。为了能实现这种转换,需要做两个工作:确定特征集和提取特征。特征集其实就是词典。特征权重的计算方式本实验选择TF*IDF。TF词频,IDF逆向文件频率。TF表示词条在文档d中出现的频率。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取以10为底的对数得到。从trainData.dat文件得到训练数据集bunch,运行响应代码。即可将每篇文档转化为特征向量,然后利用pickle保存得到的特征向量到train_tfidf.dat文件中。max_df参数含义:当构建词汇表时,严格忽略高于给出阈值的文档频率的词条,以此实现降维度。sublinear_tf参数含义:应用线性缩放TF,例如,使用1+log(tf)覆盖tf。对测试数据集testData.dat作同样的处理得到test_tfidf.dat文件。 然后,MultinomialNB实现了服从多项分布数据的朴素贝叶斯算法,也是用于文本分类(这个领域中数据往往以词向量表示,尽管在实践中 tf-idf 向量在预测时表现良好)的两大经典朴素贝叶斯算法之一。 分布参数由每类 的 向量决定, 式中 是特征的数量(对于文本分类,是词汇量的大小) 是样本中属于类 中特征 概率 。最后对于每类都计算 的和,哪类的和结果最大就预测为该类。运行相应的语句,得到MultinomialNBModel.dat文件。 最后,加载朴素贝叶斯模型并得到测试结果。从MultinomialNBModel.dat加载训练好的朴素贝叶斯分类器,从test_tfidf.dat加载测试数据集的特征向量,运行predicted = clf.predict(testSet.tfidfmetrix)预测分类结果,其中clf是分类器,得到的predicted是一个List ,每个元素对应测试数据集中一条新闻的预测结果。初始化一个大小为10*10的零矩阵,遍历predicted,如果种类i预测成了种类j,则下标为i,j的矩阵元素加一,以此来统计预测结果。最后用sklearn中的metrics工具包得到模型的精度、召回率和f1值。用metrics工具包下的classification_report函数得到每个分类结果的精度、召回率和f1值。 3.4 SVM算法 SVM算法中的数据预处理、读取TXT文件,将训练数据集加载到内存中、将每篇文档转化为特征向量,即特征提取和加载模型并查看测试结果和朴素贝叶斯算法类似,故此处省略不写。 训练SVM模型并且保存得到的结果, svm.SVC函数中最重要的两个参数分别是C和kernel,C是误差项的惩罚参数,一般取值为10的n次幂, n=-5~inf。C越大,相当于惩罚松弛变量,希望松弛变量接近0,即对误分类的惩罚增大,趋向于对训练集全分对的情况,这样会出现训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,容错能力增强,泛化能力较强。kernel为SVM算法中所用的核函数,linear时候的分类效果,发现C=1.0和kernel=linear时候分类效果最好。 实验结果 朴素贝叶斯: 支持向量机: 实验过程遇到的问题及解决办法 在新闻爬取阶段,由于一些历史网页因为种种原因无法爬取,有的网址格式不对,所以在爬取的时候需要制定相应的容错机制,来提高爬虫程序的健壮性,而且存在爬虫效率不高的问题,所以整个爬取数据集的阶段耗时较高。另外在训练SVM的时候,一开始采用的是SVM下的SVC所以训练速度极其的慢,训练加预测一次就要花费一天一夜的时间。后来发现了问题的所在,于是改采用了LinearSVC速度变得异常的快了。 实验代码 网络爬虫代码: 1.爬取所有新闻链接。 from bs4 import BeautifulSoup import requests import re import time def getUrls(url): req = requests.get(url).text if (req == None): return bf = BeautifulSoup(req, 'html.parser') div_bf = bf.find('div', attrs={'class': 'content_list'}) div_a = div_bf.find_all('div', attrs={'class': 'dd_bt'}) if (len(div_a) == 0): return; urltxt = open(b'F:\\data\\2020url.txt', 'a', encoding='UTF-8') for div in div_a: link = div.find('a').get("href") print(link) if(str(link).find("/mil/")!=-1): if(str(link).startswith("http://www.chinanews.com")==False): link = "http://www.chinanews.com"+link urltxt.write(link+'\n') urltxt.close() years = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019','2020'] months = ["01","02","03","04","05","06","07","08","09","10","11"] days = ["01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31"] for year in range(1): for month in range(11): print(years[year] + " " + months[month]) for day in range(31): try: url = "http://www.chinanews.com/scroll-news/" + years[year] + "/" + months[month] + days[day] + "/news.shtml" #print(url); getUrls(url) except: continue; 2.根据新闻链接访问新闻,并将新闻写到数据库中: #coding:utf-8 import pymysql import requests from bs4 import BeautifulSoup import re # 得到具体信息 def saveToMysql(): file_it_urls = open('/root/datamining/autourls', 'r', encoding='utf-8') link = file_it_urls.readlines() i = 0 conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="datamining", charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor) cur = conn.cursor() #优化: cur.execute("select count(*) from auto_news") already = cur.fetchone()['count(*)'] count = 0 for url in link: count +=1 # 对于视频网页的跳过 if (url.find("/shipin/") != -1): continue; if(count0): print(url+"当前url已经入库****") continue; i += 1 if (url.startswith("http://finance.") == True): continue headers = { 'Connection': 'close', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'} head = requests.head(url) requests.adapters.DEFAULT_RETRIES = 5 req = requests.get(url, headers=headers) if(req.status_code != 200): continue; req.encoding = 'GBK' bf = BeautifulSoup(req.text, 'html.parser') # 优化:增加了空白网页的处理 if (bf.title == None or bf == None): continue; title = bf.title.text # 对于404页面兼容 if (title.find("404") != -1): continue; print("开始处理第" + str(i) + "个url" + url) div1 = bf.find('div', attrs={'class': 'left_bt'}) div2 = bf.find('div', attrs={'class': 'content'}) # 如果网页不存在这两种结构 则跳过网页 if (div2 == None and div1 == None): continue; try: div = bf.find('div', attrs={'class': 'left_bt'}) h1 = div.find('h1') # 文章题目 title = re.sub(r'\s+', '', h1.get_text()) timediv = bf.find('div', attrs={'class': 'left_time'}) # 发表时间 time1 = timediv.get_text().replace(" ", "")[0:16] # 文章内容 p = bf.find('div', attrs={'class': 'left_zw'}).find_all('p', text=True) chapter = '' for ptext in p: chapter = chapter + '\n' + ptext.text; news = { 'title': title, 'href': url, 'time': time1, 'chapter': chapter } temp = (news['title'], news['href'], news['time'], news['chapter']) sql = "insert into auto_news(title,href,time,chapter) values(%s,%s,%s,%s)" cur.execute(sql, temp) cur.connection.commit() req.close() except: div = bf.find('div', attrs={'class': 'content'}) h1 = div.find('h1') # 文章题目 title = re.sub(r'\s+', '', h1.get_text()) timediv = bf.find('div', attrs={'class': 'left-t'}) # 发表时间 time1 = timediv.get_text().replace(" ", "")[0:16] # 文章内容 p = bf.find('div', attrs={'class': 'left_zw'}).find_all('p', text=True) chapter = '' for ptext in p: chapter = chapter + '\n' + ptext.text; news = { 'title': title, 'href': url, 'time': time1, 'chapter': chapter } sql = "insert into auto_news(title,href,time,chapter) values(%s,%s,%s,%s)" temp = (news['title'], news['href'], news['time'], news['chapter']) cur.execute(sql, temp) cur.connection.commit() req.close() '''except Exception as e: print("第" + str(i) + "个url处理出错:") print(e)''' conn.close() def main(): saveToMysql() if __name__ == '__main__': main() 3.将数据库中的新闻保存到本地txt文件 # -*- coding: utf-8 -*- import sys import re import jieba.posseg import os import MySQLdb reload(sys) sys.setdefaultencoding('utf-8') def savefile(dirpath,filepath,content): if os.path.exists(dirpath)==False: os.makedirs(dirpath) savepath = os.path.join(dirpath,filepath) print(savepath) with open(savepath,"w") as fp: fp.write(content) fp.close() def function(datalist,path): stopwords = [] for word in open('stop_words_ch.txt', 'r'): stopwords.append(word.strip()) db = MySQLdb.connect("localhost", "root", "root", "datamining", charset='utf8') cursor = db.cursor() i = 200001 for item in datalist: sql = "select content from " + item cursor.execute(sql) results = cursor.fetchall() for row in results: content = [] temptest = re.sub("[@·《》、.%,。?“”():(\u3000)(\xa0)!… ;▼]|[a-zA-Z0-9]|['月''日''年']", "", row['content']) words = jieba.posseg.cut(temptest) for w in words: if w.word not in stopwords and w.flag == 'n': content.append(w.word) forpath = path + "/" + item.replace("_even","").replace("_odd","") subpath = str(i) +".txt" savefile(forpath,subpath," ".join(content)) i = i+1 conn.close() 朴素贝叶斯分类器代码: # -*- coding: utf-8 -*- import os import datetime import pickle import matplotlib.pyplot as plt # 绘图库 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.datasets.base import Bunch from sklearn import svm from sklearn.metrics import classification_report from sklearn.naive_bayes import MultinomialNB from sklearn import metrics import numpy as np import pandas as pd from sklearn.metrics import confusion_matrix class TF_IDF: data_path = "F:\data\segmentation\\train\\" test_path = "F:\data\segmentation\\test\\" result_path = "F:\data\\BYSRESULT\\" stop_words_path = "F:\data\stop_words_ch.txt" #class_path = "../test_res/" stop_words = [] class_code = {"auto_news":0,"cj_news":1,"educate_news":2,"fangchan_news":3, "mil_news":4,"stock_news":5,"technology_news":6,"tw_news":7, "ty_news":8,"yl_news":9} def __init__(self): TF_IDF.ConfusionMatrix = np.zeros([len(TF_IDF.class_code),len(TF_IDF.class_code)]) TF_IDF.stop_words = self.read_file(TF_IDF.stop_words_path).strip().split("\n") def printList(self, mylist): for item in mylist: print(item), print # 写文件 def save_file(self, path, content): with open(path, 'w') as f: f.write(content) # 读文件 def read_file(self, path): with open(path, 'r') as f: return f.read() #读取bunch文件 def readbunchobj(self,path): with open(path, "rb") as file_obj: bunch = pickle.load(file_obj) return bunch def writebunchobj(self,path, bunchobj): with open(path, "wb") as file_obj: pickle.dump(bunchobj, file_obj) # 加载分词结果文件 def loadSegmentation(self,path): begintime = datetime.datetime.now() fileDocs = os.listdir(path) print(fileDocs) for item in fileDocs: if item.startswith("."): fileDocs.remove(item) print(fileDocs) bunch = Bunch(target_name=[], label=[], filenames=[], contents=[]) bunch.target_name.extend(fileDocs) # 获取每个目录下所有的文件 for mydir in fileDocs: class_path = path + mydir + "\\" # 拼出分类子目录的路径 file_list = os.listdir(class_path) # 获取class_path下的所有文件 for file_path in file_list: # 遍历类别目录下文件 if file_path.endswith("txt") == False: continue fullname = class_path + file_path # 拼出文件名全路径 print("当前处理的文件:" + fullname) bunch.label.append(mydir) bunch.contents.append(self.read_file(fullname)) # 读取文件内容 bunch.filenames.append(mydir+"\\"+file_path) self.writebunchobj(TF_IDF.result_path+"trainData.dat",bunch) endtime = datetime.datetime.now() span = endtime - begintime print("训练bunch:contents长度、label长度:",len(bunch.contents),len(bunch.label)) print("训练数据保存完成,所花费时间为",span.seconds) def loadTestData(self, path): begintime = datetime.datetime.now() fileDocs = os.listdir(path) print (fileDocs) for item in fileDocs: if item.startswith("."): fileDocs.remove(item) print (fileDocs) bunch = Bunch(target_name=[], label=[], filenames=[], contents=[]) bunch.target_name.extend(fileDocs) # 获取每个目录下所有的文件 for mydir in fileDocs: class_path = path + mydir + "/" # 拼出分类子目录的路径 file_list = os.listdir(class_path) # 获取class_path下的所有文件 for file_path in file_list: # 遍历类别目录下文件 if file_path.endswith("txt") == False: continue fullname = class_path + file_path # 拼出文件名全路径 print("当前处理的文件:" + fullname) bunch.label.append(mydir) bunch.contents.append(self.read_file(fullname)) # 读取文件内容 bunch.filenames.append(mydir + "\\" + file_path) self.writebunchobj(TF_IDF.result_path + "testData.dat", bunch) endtime = datetime.datetime.now() span = endtime - begintime print ("测试bunch:contents长度、label长度:", len(bunch.contents), len(bunch.label)) print ("测试数据保存完成,所花费时间为",span.seconds) def calculateTFIDF(self,train_tfidf_path,bunch_path,tfidf_path): begintime = datetime.datetime.now() bunch = self.readbunchobj(bunch_path) tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tfidfmetrix=[], vocabulary={}) if train_tfidf_path is not None: trainbunch = self.readbunchobj(train_tfidf_path) tfidfspace.vocabulary = trainbunch.vocabulary vectorizer = TfidfVectorizer(stop_words=TF_IDF.stop_words, sublinear_tf=True, max_df=0.5, vocabulary=trainbunch.vocabulary) tfidfspace.tfidfmetrix = vectorizer.fit_transform(bunch.contents) else: vectorizer = TfidfVectorizer(stop_words=TF_IDF.stop_words, sublinear_tf=True, max_df=0.5) tfidfspace.tfidfmetrix = vectorizer.fit_transform(bunch.contents) tfidfspace.vocabulary = vectorizer.vocabulary_ self.writebunchobj(TF_IDF.result_path + "myvocabulary.dat", vectorizer.vocabulary_) # ch2 = SelectKBest(chi2, k=130000) # train_X = ch2.fit_transform() self.writebunchobj(tfidf_path, tfidfspace) endtime = datetime.datetime.now() span = endtime - begintime if train_tfidf_path is not None: print("生成测试tf-idf矩阵,所花费时间为",span.seconds) else: print ("生成训练tf-idf矩阵,所花费时间为",span.seconds) def metrics_result(self,actual, predict): print('精度:{0:.3f}'.format(metrics.precision_score(actual, predict, average='weighted'))) print('召回:{0:0.3f}'.format(metrics.recall_score(actual, predict, average='weighted'))) print('f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict, average='weighted'))) print(classification_report(testSet.label, predict)) def trainProcess(self,trainSet): print ("朴素贝叶斯训练过程开始:") begintime = datetime.datetime.now() # 训练分类器:输入词袋向量和分类标签,alpha:0.001 alpha越小,迭代次数越多,精度越高 clf = MultinomialNB(alpha=0.00001).fit(trainSet.tfidfmetrix, trainSet.label) with open(TF_IDF.result_path+"MultinomialNBModel.dat", "wb") as file_obj: pickle.dump(clf, file_obj) endtime = datetime.datetime.now() span = endtime - begintime print ("朴素贝叶斯训练时间:",span.seconds) def predictProcess(self,modelPath,testSet): begintime = datetime.datetime.now() clf = self.readbunchobj(modelPath) errorFile = "MultinomialNBerror.txt" # 预测分类结果 predicted = clf.predict(testSet.tfidfmetrix) endtime = datetime.datetime.now() span = endtime - begintime print("SVM预测花费时间:", span, "秒。") outPutList = [] for flabel, file_name, expct_cate in zip(testSet.label, testSet.filenames, predicted): self.ConfusionMatrix[TF_IDF.class_code[flabel]][TF_IDF.class_code[expct_cate]] += 1 if flabel != expct_cate: outPutList.append(file_name + ": 实际类别:" + flabel + " -->预测类别:" + expct_cate) self.save_file(TF_IDF.result_path + errorFile, "\n".join(outPutList)) print("预测完成") self.showPredictResult() self.metrics_result(testSet.label, predicted) print ("混淆矩阵为:") pd.set_option('display.max_rows', 100, 'display.max_columns', 1000, "display.max_colwidth", 1000, 'display.width', 1000) print (pd.DataFrame(confusion_matrix(testSet.label, predicted), columns=["auto", "cj", "educate", "fangchan", "mil", "stock", "technology", "tw", "ty", "yl"], index=["auto", "cj", "educate", "fangchan", "mil", "stock", "technology", "tw", "ty", "yl"])) plt.figure(figsize=(12, 8), dpi=100) np.set_printoptions(precision=2) cm_normalized = confusion_matrix(testSet.label, predicted).astype('float') / confusion_matrix(testSet.label, predicted).sum(axis=1)[:, np.newaxis] ind_array = np.arange(10) x, y = np.meshgrid(ind_array, ind_array) i = 0; for x_val, y_val in zip(x.flatten(), y.flatten()): c = cm_normalized[y_val][x_val] i+=1 print("confusion_matrix Loding...", i, "%") plt.text(x_val, y_val,"%0.2f" % (c,), color='red', fontsize=13, va='center', ha='center') print("here") plt.imshow(confusion_matrix(testSet.label, predicted), interpolation='nearest', cmap=plt.cm.binary) plt.title("confusion_matrix") plt.colorbar() xlocations = np.array(range(10)) plt.xticks(xlocations, self.class_code, rotation=90) plt.yticks(xlocations, self.class_code) plt.ylabel('Actual label') plt.xlabel('Predict label') # offset the tick tick_marks = np.array(range(10)) + 0.5 plt.gca().set_xticks(tick_marks, minor=True) plt.gca().set_yticks(tick_marks, minor=True) plt.gca().xaxis.set_ticks_position('none') plt.gca().yaxis.set_ticks_position('none') plt.grid(True, which='minor', linestyle='-') plt.gcf().subplots_adjust(bottom=0.15) # show confusion matrix plt.savefig("bayes_confusion_matrix.png", format='png') plt.show() def showPredictResult(self): for i in range(len(self.ConfusionMatrix)): keyWord = "" for key in TF_IDF.class_code: if TF_IDF.class_code[key] == i: keyWord = key break print ("===================================================") print (keyWord + "类预测总数为:", np.sum([self.ConfusionMatrix[i]])) print (keyWord + "类预测正确数为:", self.ConfusionMatrix[i][i]) for j in range(len(self.ConfusionMatrix[0])): if j == i: continue predictKey = "" for key in TF_IDF.class_code: if TF_IDF.class_code[key] == j: predictKey = key break print (keyWord+"类预测为"+predictKey+"数为:", self.ConfusionMatrix[i][j]) print ("===================================================") if __name__ == '__main__': TF_IDF = TF_IDF() # 加载训练集和测试集===》 print("加载训练集和测试集===》") #TF_IDF.loadSegmentation(TF_IDF.data_path) #TF_IDF.loadTestData(TF_IDF.test_path) # 计算TF-IDF矩阵===》 '''print("\n计算TF-IDF矩阵===》") train_tfidf_path = TF_IDF.result_path+"train_tfidf.dat" train_bunch_path = TF_IDF.result_path+"trainData.dat" TF_IDF.calculateTFIDF(train_tfidf_path=None,bunch_path=train_bunch_path,tfidf_path=train_tfidf_path) test_tfidf_path = TF_IDF.result_path+"test_tfidf.dat" test_bunch_path = TF_IDF.result_path+"testData.dat" TF_IDF.calculateTFIDF(train_tfidf_path=train_tfidf_path, bunch_path=test_bunch_path, tfidf_path=test_tfidf_path)''' # 训练-预测-显示结果===》 print("\n训练-预测-显示结果===》") train_tfidf_path = TF_IDF.result_path + "train_tfidf.dat" trainSet = TF_IDF.readbunchobj(train_tfidf_path) TF_IDF.trainProcess(trainSet) test_tfidf_path = TF_IDF.result_path + "test_tfidf.dat" testSet = TF_IDF.readbunchobj(test_tfidf_path) MultinomialNBPath = TF_IDF.result_path + "MultinomialNBModel.dat" TF_IDF.predictProcess(MultinomialNBPath,testSet) SVM分类器代码: # -*- coding: utf-8 -*- import os import datetime import pickle from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.datasets.base import Bunch from sklearn.svm import LinearSVC from sklearn.metrics import classification_report from sklearn import metrics import numpy as np import pandas as pd import matplotlib.pyplot as plt # 绘图库 from sklearn.metrics import confusion_matrix class TF_IDF: data_path = "F:\data\segmentation\\train\\" test_path = "F:\data\segmentation\\test\\" result_path = "F:\data\\SVMRESULT\\" stop_words_path = "F:\data\stop_words_ch.txt" stop_words = [] class_code = {"auto_news":0,"cj_news":1,"educate_news":2,"fangchan_news":3, "mil_news":4,"stock_news":5,"technology_news":6,"tw_news":7, "ty_news":8,"yl_news":9} def __init__(self): self.ConfusionMatrix = np.zeros([len(TF_IDF.class_code),len(TF_IDF.class_code)]) TF_IDF.stop_words = self.read_file(TF_IDF.stop_words_path).strip().split("\n") def printList(self, mylist): for item in mylist: print(item), print # 写文件 def save_file(self, path, content): with open(path, 'w') as f: f.write(content) # 读文件 def read_file(self, path): with open(path, 'r') as f: return f.read() #读取bunch文件 def readbunchobj(self,path): with open(path, "rb") as file_obj: bunch = pickle.load(file_obj) return bunch def writebunchobj(self,path, bunchobj): with open(path, "wb") as file_obj: pickle.dump(bunchobj, file_obj) # 训练集分词 然后转化为bunch结构 def loadTrainData(self,path): begintime = datetime.datetime.now() fileDocs = os.listdir(path) print(fileDocs) #定义bunch结构 bunch = Bunch(target_name=[], label=[], filenames=[], contents=[]) bunch.target_name.extend(fileDocs) #将训练集下的所有文件添加到bunch中去 # 获取每个目录下所有的文件 for mydir in fileDocs: class_path = path + mydir + "\\" # 拼出分类子目录的路径 file_list = os.listdir(class_path) # 获取class_path下的所有文件 for file_path in file_list: # 遍历类别目录下文件 if file_path.endswith("txt") == False: continue fullname = class_path + file_path # 拼出文件名全路径 print("当前处理的文件:" + fullname) #添加文件 bunch.label.append(mydir) bunch.contents.append(self.read_file(fullname)) # 读取文件内容 bunch.filenames.append(mydir+"\\"+file_path) self.writebunchobj(TF_IDF.result_path+"trainData.dat",bunch) endtime = datetime.datetime.now() span = endtime - begintime print("训练bunch:contents长度、label长度:",len(bunch.contents),len(bunch.label)) print("训练数据保存完成,所花费时间为",span.seconds) #测试集分词 然后保存为bunch结构 def loadTestData(self, path): begintime = datetime.datetime.now() fileDocs = os.listdir(path) print (fileDocs) for item in fileDocs: if item.startswith("."): fileDocs.remove(item) print (fileDocs) bunch = Bunch(target_name=[], label=[], filenames=[], contents=[]) bunch.target_name.extend(fileDocs) # 获取每个目录下所有的文件 for mydir in fileDocs: class_path = path + mydir + "/" # 拼出分类子目录的路径 file_list = os.listdir(class_path) # 获取class_path下的所有文件 for file_path in file_list: # 遍历类别目录下文件 if file_path.endswith("txt") == False: continue fullname = class_path + file_path # 拼出文件名全路径 bunch.label.append(mydir) print("当前处理的文件:" + fullname) bunch.contents.append(self.read_file(fullname)) # 读取文件内容 bunch.filenames.append(mydir + "\\" + file_path) self.writebunchobj(TF_IDF.result_path + "testData.dat", bunch) endtime = datetime.datetime.now() span = endtime - begintime print ("测试bunch:contents长度、label长度:", len(bunch.contents), len(bunch.label)) print ("测试数据保存完成,所花费时间为",str(span.seconds)+"秒。") #训练集TF-IDF过程---》生成词向量 --》生成tfidf矩阵 def calulateTrainTFIDF(self,train_bunch_path,train_tfidf_path): begintime = datetime.datetime.now() bunch = self.readbunchobj(train_bunch_path) tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tfidfmetrix=[], vocabulary={}) # 用TF-IDF算法来计算特征词的权重值是表示当一个词在这篇文档中出现的频率越高,同时在其他文档中出现的次数越少,则表明该词对于表示这篇文档的区分能力越强,所以其权重值就应该越大。 vectorizer = TfidfVectorizer(stop_words=TF_IDF.stop_words, sublinear_tf=True, max_df=0.5,max_features=100000) # 将以词汇表示的文档转换为TF - IDF权重矩阵,即转换词向量的形式 tfidfspace.tfidfmetrix = vectorizer.fit_transform(bunch.contents) #print(tfidfspace.tfidfmetrix) self.save_file("F:\\data\\VOCABULARY\\get_feature_names.txt", str()) tfidfspace.vocabulary = vectorizer.vocabulary_ print("词向量的维度:",len(tfidfspace.vocabulary)) #将词向量保存起来,以便生成测试集的TFIDF 矩阵时使用 self.writebunchobj(TF_IDF.result_path + "myvocabulary.dat", vectorizer.vocabulary_) self.save_file("F:\\data\\VOCABULARY\\test.txt", str(vectorizer.vocabulary_)) self.writebunchobj(train_tfidf_path, tfidfspace) endtime = datetime.datetime.now() span = endtime - begintime print("训练集计算TFIDF矩阵,花费时间为:", span.seconds,"秒。") # 测试集TF-IDF过程---》使用训练集相同的词向量 --》生成tfidf矩阵 def calulateTestTFIDF(self,test_bunch_path,test_tfidf_path,train_tfidf_path): begintime = datetime.datetime.now() bunch = self.readbunchobj(test_bunch_path) tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tfidfmetrix=[], vocabulary={}) trainbunch = self.readbunchobj(train_tfidf_path) tfidfspace.vocabulary = trainbunch.vocabulary # 词向量 使用max_features实现降维度 vectorizer = TfidfVectorizer(stop_words=TF_IDF.stop_words, sublinear_tf=True, max_df=0.5,max_features=100000, vocabulary=trainbunch.vocabulary) tfidfspace.tfidfmetrix = vectorizer.fit_transform(bunch.contents) self.writebunchobj(test_tfidf_path, tfidfspace) endtime = datetime.datetime.now() span = endtime - begintime print("测试集计算TFIDF矩阵,花费时间为:", span.seconds,"秒。") def SVMProcess(self, trainSet): print ("SVM训练过程开始:") begintime = datetime.datetime.now() #使用LinearSVC不要使用SVM的SVC 那个收敛速度太慢 clf = LinearSVC(C=1, tol=1e-5) #指定训练的数据和实际label分类并开始训练 clf.fit(trainSet.tfidfmetrix, trainSet.label) #保存训练好的模型 with open("F:\data\SVMRESULT\svmModel.dat", "wb") as file_obj: pickle.dump(clf, file_obj) endtime = datetime.datetime.now() span = endtime - begintime print ("SVM训练花费时间:",span,"秒。") test_tfidf_path = TF_IDF.result_path + "test_tfidf_svm.dat" testSet = TF_IDF.readbunchobj(test_tfidf_path) self.predictProcess("F:\data\SVMRESULT\svmModel.dat",testSet) def predictProcess(self,modelPath,testSet): #加载训练好的模型\ begintime = datetime.datetime.now() clf = self.readbunchobj(modelPath) errorFile = "SVMerror.txt" # 预测分类结果 predicted = clf.predict(testSet.tfidfmetrix) endtime = datetime.datetime.now() span = endtime - begintime print("SVM预测花费时间:", span, "秒。") outPutList = [] for flabel, file_name, expct_cate in zip(testSet.label, testSet.filenames, predicted): self.ConfusionMatrix[TF_IDF.class_code[flabel]][TF_IDF.class_code[expct_cate]] += 1 if flabel != expct_cate: outPutList.append(file_name + ": 实际类别:" + flabel + " -->预测类别:" + expct_cate) self.save_file(TF_IDF.result_path + errorFile, "\n".join(outPutList)) print("预测完成") #self.showPredictResult() self.metrics_result(testSet.label, predicted) print ("混淆矩阵为:") # pandas :是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的 pd.set_option('display.max_rows', 100, 'display.max_columns', 1000, "display.max_colwidth", 1000, 'display.width', 1000) print (pd.DataFrame(confusion_matrix(testSet.label, predicted), columns=["auto", "cj", "educate", "fangchan", "mil", "stock", "technology", "tw", "ty", "yl"], index=["auto", "cj", "educate", "fangchan", "mil", "stock", "technology", "tw", "ty", "yl"])) plt.figure(figsize=(12, 8), dpi=100) np.set_printoptions(precision=2) cm_normalized = confusion_matrix(testSet.label, predicted).astype('float') / confusion_matrix(testSet.label,predicted).sum( axis=1)[:, np.newaxis] ind_array = np.arange(10) x, y = np.meshgrid(ind_array, ind_array) i = 0; for x_val, y_val in zip(x.flatten(), y.flatten()): c = cm_normalized[y_val][x_val] i += 1 print("confusion_matrix Loding...",i,"%") plt.text(x_val, y_val, "%0.2f" % (c,), color='red', fontsize=13, va='center', ha='center') print("here") plt.imshow(confusion_matrix(testSet.label, predicted), interpolation='nearest', cmap=plt.cm.binary) plt.title("confusion_matrix") plt.colorbar() xlocations = np.array(range(10)) plt.xticks(xlocations, self.class_code, rotation=90) plt.yticks(xlocations, self.class_code) plt.ylabel('Actual label') plt.xlabel('Predict label') tick_marks = np.array(range(10)) + 0.5 plt.gca().set_xticks(tick_marks, minor=True) plt.gca().set_yticks(tick_marks, minor=True) plt.gca().xaxis.set_ticks_position('none') plt.gca().yaxis.set_ticks_position('none') plt.grid(True, which='minor', linestyle='-') plt.gcf().subplots_adjust(bottom=0.15) # show confusion matrix plt.savefig("confusion_matrix.png", format='png') plt.show() def metrics_result(self,actual, predict): print('精度:{0:.3f}'.format(metrics.precision_score(actual, predict, average='weighted'))) print('召回:{0:0.3f}'.format(metrics.recall_score(actual, predict, average='weighted'))) print('f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict, average='weighted'))) print(classification_report(actual, predict)) if __name__ == '__main__': TF_IDF = TF_IDF() # 加载训练集和测试集===》 print("加载训练集和测试集===》") #TF_IDF.loadTrainData(TF_IDF.data_path) #TF_IDF.loadTestData(TF_IDF.test_path) #计算TF-IDF矩阵===》 '''print("\n计算TF-IDF矩阵===》") train_tfidf_path = TF_IDF.result_path + "train_tfidf_svm.dat" train_bunch_path = TF_IDF.result_path + "trainData.dat" TF_IDF.calulateTrainTFIDF(train_tfidf_path=train_tfidf_path, train_bunch_path=train_bunch_path) test_tfidf_path = TF_IDF.result_path + "test_tfidf_svm.dat" test_bunch_path = TF_IDF.result_path + "testData.dat" TF_IDF.calulateTestTFIDF(train_tfidf_path=train_tfidf_path, test_bunch_path=test_bunch_path, test_tfidf_path=test_tfidf_path)''' # 训练-预测-显示结果===》 print("\n训练-预测-显示结果===》") train_tfidf_path = TF_IDF.result_path + "train_tfidf_svm.dat" trainSet = TF_IDF.readbunchobj(train_tfidf_path) TF_IDF.SVMProcess(trainSet)