# 软件工程测试1 **Repository Path**: liangxinixn/software-engineering-test-1 ## Basic Information - **Project Name**: 软件工程测试1 - **Description**: 软件工程课程第一节实验课,测试git工具的使用 - **Primary Language**: Java - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-03-17 - **Last Updated**: 2022-05-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 软件工程笔记 #### git的使用 > 第一次使用要配置邮箱和用户名,配置了ssh密钥可以不输入密码就上传。 1. 添加到缓存区 ``` git add . ``` 2. 提交到工作区 ``` git commit . -m "注释" ``` 3. 推送到仓库 ``` git push origin master ``` #### 练习使用git - 第二个题目用到动态规划问题 - 第三个题目判断m是否为素数,用m除2到m-1,如果都又余数则说明m是素数 #### 文本处理nltk初次使用 > 文本处理要使用到这个库 1. 导入数据函数 ``` def load_data(filePath): with open(filePath,encoding="utf-8") as f: content = f.read() return content ``` 2. 将文本切分为句子 ``` from nltk.tokenize import sent_tokenize if __name__ == "__main__": content = load_data("./one.txt") # print(content[:4000]) # 将文本切分为句子 print(sent_tokenize(content)) ``` > 好吧,这个库有点厉害,会根据意思来划分句子。还是直接使用python相关的函数来处理把 #### 统计单词频率 1. 读取文件,全部转换为小写,去挑换行符和空格。这是处理的结果 ``` if __name__ == "__main__": content = load_data("./one.txt") content = content.lower() content = content.replace('\n','') content = content.replace(' ', '') print(content[:400]) ``` ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323202656.png) 2. 我使用了正则表达式,速度还是很快的。最后列表很长,以为这个会非常消耗时间。下载每个小写的单词都装在了列表中。 ``` import re result = re.findall(r'[\w]',content) print(content[:400]) print(result[:400]) print(len(result)) ``` ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323203144.png) 3. 这是一波操作,统计出每个单词的频率。python内置的方法很好用 ``` #测试方法,计算类表中某个元素的数量 print(result.count('a')) # 字典结构,用来存放统计数据 graph = {} # 不用一个个的写,将列表元素去重,然后还可以排序 # for i in ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'] element = list(set(result)) print(element) # \w忘记还匹配了数字,上面改为[a-z] print(len(element)) # 排序有些问题,但也没必要 # element = element.sort(reverse=0) for i in element: graph[i] = result.count(i) print(graph) ``` 4. 把字典排序有些问题,字典的元素没有顺序,排序存放到列表中。 ``` # 把字典中的值放在一个列表中,排序,然后对比把键加入 print(sort_list) # 排序后,大到小 print(sorted(sort_list,reverse=True)) newSort_list = sorted(sort_list,reverse=True) # 字典插入的数据没有规律 newestSortList = [] for j in newSort_list: for i in element: if graph[i] == j: newestSortList.append(i) # 字典存放没有顺序,这样排序后对比了一下是对的。 print(newestSortList) ``` 5. 条形图 ``` # 画图 import matplotlib.pyplot as plt plt.bar(newestSortList,newSort_list) plt.show() ``` ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323212533.png) 6. 饼状图 ``` plt.pie(newSort_list, labels=newestSortList, autopct='%3.2f%%') plt.show() ``` ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213029.png) > 任务完成,文件在04/test.py 7. 哦,忘记了跑一下最终文件,600篇英语文章的。跑的时候要用nodpad++转换为utf-8格式 - 记录一下内存 ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213751.png) - 结果 ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213816.png) #### 输出前100个常用单词和次数 > 不去掉停止词,统计的结果。 1. 参考这篇博客,原来有个很好用的包来统计列表中元素的次数,还可以排序,python第三方包真多,真好用。 > 博客地址:https://blog.csdn.net/onestab/article/details/78307765 2. 使用了nltk库的分词功能,很强大,还把功能封装为函数。 ``` import re import nltk from collections import Counter # 导入数据 def load_data(filePath): with open(filePath,encoding="utf-8") as f: content = f.read() return content # 初次处理数据 def deal_data(cont): content = cont.lower() content = content.replace('\n','') content = content.replace(' ', '') return cont # 分词 def devide_data(cont): text = nltk.word_tokenize(cont) return text # 统计词频 def count_data(cont_list): c = Counter() for x in cont_list: if len(x)>1 and x != ';': c[x] += 1 print(c.most_common(100)) if __name__ == "__main__": content = load_data("./one.txt") content_list = devide_data(content) count_data(content_list) ``` - 结果 ``` [('the', 10930), ('to', 6467), ('and', 5785), ('of', 5259), ('Harry', 3990), ('said', 3901), ('he', 3728), ('was', 3651), ("'s", 3306), ('his', 3141), ('in', 2998), ('you', 2850), ('it', 2662), ('had', 2348), ('--', 2329), ('that', 2253), ('at', 2092), ('on', 1895), ('her', 1739), ('as', 1709), ('him', 1601), ('with', 1553), ("n't", 1488), ('not', 1479), ('Hermione', 1276), ('Ron', 1274), ('for', 1255), ('she', 1186), ('they', 1162), ('He', 1069), ('from', 1058), ('be', 1036), ('were', 1009), ('have', 1007), ('up', 1001), ('them', 982), ('out', 963), ('all', 956), ('but', 922), ('do', 863), ('we', 803), ('what', 796), ('been', 775), ('back', 774), ('The', 773), ('is', 730), ('did', 727), ('into', 691), ('could', 678), ('this', 670), ('who', 645), ('so', 639), ('an', 627), ('Dumbledore', 608), ('would', 602), ('Sirius', 601), ('me', 588), ('their', 587), ('about', 586), ('just', 584), ('over', 571), ('now', 570), ('there', 568), ('down', 557), ('Professor', 554), ('Umbridge', 553), ('are', 537), ('looked', 537), ('know', 528), ('your', 510), ('more', 508), ("'re", 505), ('if', 500), ('like', 497), ('by', 496), ('got', 493), ('when', 493), ('very', 492), ('then', 488), ('around', 486), ('again', 484), ('though', 476), ('one', 471), ("'ve", 461), ('Weasley', 445), ('voice', 431), ("'You", 431), ('Hagrid', 428), ('looking', 424), ('see', 413), ('off', 413), ('think', 409), ('time', 399), ('right', 394), ('face', 387), ('no', 382), ("'ll", 381), ('going', 379), ('still', 379), ('door', 377)] ``` 3. 要把处理函数用上,转换为小写。这是转换为小写运行的结果 ``` [('the', 11724), ('to', 6511), ('and', 5997), ('of', 5294), ('he', 4810), ('harry', 3996), ('said', 3908), ('was', 3668), ("'s", 3320), ('his', 3251), ('in', 3072), ('you', 3006), ('it', 2910), ('that', 2378), ('had', 2366), ('--', 2329), ('at', 2143), ('on', 1930), ('as', 1802), ('her', 1778), ('him', 1607), ('with', 1584), ('not', 1518), ("n't", 1501), ('they', 1476), ('she', 1473), ('for', 1318), ('hermione', 1278), ('ron', 1275), ('but', 1126), ('from', 1075), ('be', 1051), ('have', 1028), ('were', 1012), ('up', 1011), ('all', 1004), ('them', 986), ('out', 980), ('do', 907), ('we', 904), ('what', 886), ('been', 783), ('back', 779), ('there', 779), ('did', 749), ('is', 748), ('this', 744), ('into', 692), ('so', 685), ('could', 683), ('who', 677), ('an', 654), ('dumbledore', 611), ('now', 610), ('would', 607), ('just', 606), ('sirius', 605), ('their', 600), ('me', 595), ('about', 590), ('when', 581), ('over', 579), ('if', 571), ('down', 561), ('then', 560), ('are', 556), ('professor', 555), ('umbridge', 555), ('looked', 537), ('know', 534), ('your', 525), ('more', 517), ('by', 515), ("'re", 508), ('like', 507), ('got', 507), ('one', 501), ('very', 495), ('again', 494), ('though', 492), ('around', 488), ("'you", 483), ("'ve", 476), ('weasley', 449), ('looking', 433), ('voice', 431), ('hagrid', 428), ('see', 421), ('think', 417), ('off', 416), ('no', 413), ('time', 404), ('right', 398), ('still', 395), ('face', 388), ('going', 385), ("'ll", 383), ('or', 380), ('door', 377), ('head', 372)] ``` #### 去掉停用词统计 1. 使用了nlkt中的停用词包 ``` from nltk.corpus import stopwords def count_data(cont_list): c = Counter() theStopWord = stopwords.words('english') print(theStopWord) for x in cont_list: if len(x)>1 and (x not in theStopWord): c[x] += 1 print(c.most_common(100)) ``` 2. 运行结果有些瑕疵,一些词没有包含在停用词中。对比去掉了大部分,少部分可以手动添加,或者使用正则表达式匹配 ![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323222837.png) > 最终文件在05/test.py中。 #### 形态学归一化 1. 判断相同单词的不同形态,nklt库中也包含这个功能。这里列举两个类 ``` import nltk from nltk.stem import PorterStemmer stemmerporter = PorterStemmer() print(stemmerporter.stem('working')) print(stemmerporter.stem('happiness')) ``` > 输出 work happi ``` import nltk from nltk.stem import LancasterStemmer stemmerlan=LancasterStemmer() print(stemmerlan.stem('working')) print(stemmerlan.stem('happiness')) ``` > 输出 work happy > 参考地址:https://github.com/PacktPublishing/Mastering-Natural-Language-Processing-with-Python/tree/master/Chapter%203