# 软件工程测试1

**Repository Path**: liangxinixn/software-engineering-test-1

## Basic Information

- **Project Name**: 软件工程测试1
- **Description**: 软件工程课程第一节实验课，测试git工具的使用
- **Primary Language**: Java
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-03-17
- **Last Updated**: 2022-05-31

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 软件工程笔记
#### git的使用
> 第一次使用要配置邮箱和用户名，配置了ssh密钥可以不输入密码就上传。
1. 添加到缓存区
```
git add .
```
2. 提交到工作区
```
git commit . -m "注释"
```
3. 推送到仓库

```
git push origin master
```

#### 练习使用git

- 第二个题目用到动态规划问题
- 第三个题目判断m是否为素数，用m除2到m-1，如果都又余数则说明m是素数

#### 文本处理nltk初次使用
> 文本处理要使用到这个库
1. 导入数据函数
```
def load_data(filePath):
    with open(filePath,encoding="utf-8") as f:
        content = f.read()
    return content

```
2. 将文本切分为句子
```
from nltk.tokenize import sent_tokenize
if __name__ == "__main__":
    content = load_data("./one.txt")
    # print(content[:4000])

    # 将文本切分为句子
    print(sent_tokenize(content))
```
> 好吧，这个库有点厉害，会根据意思来划分句子。还是直接使用python相关的函数来处理把

#### 统计单词频率
1. 读取文件,全部转换为小写，去挑换行符和空格。这是处理的结果
```
if __name__ == "__main__":
    content = load_data("./one.txt")
    content = content.lower()
    content = content.replace('\n','')
    content = content.replace(' ', '')
    print(content[:400])
```
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323202656.png)

2. 我使用了正则表达式，速度还是很快的。最后列表很长，以为这个会非常消耗时间。下载每个小写的单词都装在了列表中。

```
import re
    result = re.findall(r'[\w]',content)
    print(content[:400])
    print(result[:400])
    print(len(result))
```
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323203144.png)

3. 这是一波操作，统计出每个单词的频率。python内置的方法很好用
```
#测试方法，计算类表中某个元素的数量
    print(result.count('a'))
    # 字典结构，用来存放统计数据
    graph = {}
    # 不用一个个的写，将列表元素去重，然后还可以排序
    # for i in ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
    element = list(set(result))
    print(element)
    # \w忘记还匹配了数字,上面改为[a-z]
    print(len(element))
    # 排序有些问题，但也没必要
    # element = element.sort(reverse=0)
    for i in element:
        graph[i] = result.count(i)
    print(graph)
```
4. 把字典排序有些问题，字典的元素没有顺序，排序存放到列表中。
```
# 把字典中的值放在一个列表中，排序，然后对比把键加入
    print(sort_list)
    # 排序后,大到小
    print(sorted(sort_list,reverse=True))
    newSort_list = sorted(sort_list,reverse=True)
    # 字典插入的数据没有规律
    newestSortList = []
    for j in newSort_list:
        for i in element:
            if graph[i] == j:
                newestSortList.append(i)
    # 字典存放没有顺序，这样排序后对比了一下是对的。
    print(newestSortList)

```
5. 条形图
```
# 画图
    import matplotlib.pyplot as plt

    plt.bar(newestSortList,newSort_list)
    plt.show()
```
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323212533.png)
6. 饼状图
```
    plt.pie(newSort_list, labels=newestSortList, autopct='%3.2f%%')
    plt.show()
```
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213029.png)

> 任务完成，文件在04/test.py

7. 哦，忘记了跑一下最终文件，600篇英语文章的。跑的时候要用nodpad++转换为utf-8格式
- 记录一下内存
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213751.png)
- 结果
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323213816.png)

#### 输出前100个常用单词和次数
> 不去掉停止词，统计的结果。
1. 参考这篇博客，原来有个很好用的包来统计列表中元素的次数，还可以排序，python第三方包真多，真好用。
> 博客地址：https://blog.csdn.net/onestab/article/details/78307765
2. 使用了nltk库的分词功能，很强大，还把功能封装为函数。

```
import re
import nltk
from collections import Counter
# 导入数据
def load_data(filePath):
    with open(filePath,encoding="utf-8") as f:
        content = f.read()
    return content
# 初次处理数据
def deal_data(cont):
    content = cont.lower()
    content = content.replace('\n','')
    content = content.replace(' ', '')
    return cont
# 分词
def devide_data(cont):
    text = nltk.word_tokenize(cont)
    return text

# 统计词频
def count_data(cont_list):
    c = Counter()
    for x in cont_list:
        if len(x)>1 and x != ';':
            c[x] += 1
    print(c.most_common(100))
if __name__ == "__main__":
    content = load_data("./one.txt")
    content_list = devide_data(content)
    count_data(content_list)

```
- 结果

```
[('the', 10930), ('to', 6467), ('and', 5785), ('of', 5259), ('Harry', 3990), ('said', 3901), ('he', 3728), ('was', 3651), ("'s", 3306), ('his', 3141), ('in', 2998), ('you', 2850), ('it', 2662), ('had', 2348), ('--', 2329), ('that', 2253), ('at', 2092), ('on', 1895), ('her', 1739), ('as', 1709), ('him', 1601), ('with', 1553), ("n't", 1488), ('not', 1479), ('Hermione', 1276), ('Ron', 1274), ('for', 1255), ('she', 1186), ('they', 1162), ('He', 1069), ('from', 1058), ('be', 1036), ('were', 1009), ('have', 1007), ('up', 1001), ('them', 982), ('out', 963), ('all', 956), ('but', 922), ('do', 863), ('we', 803), ('what', 796), ('been', 775), ('back', 774), ('The', 773), ('is', 730), ('did', 727), ('into', 691), ('could', 678), ('this', 670), ('who', 645), ('so', 639), ('an', 627), ('Dumbledore', 608), ('would', 602), ('Sirius', 601), ('me', 588), ('their', 587), ('about', 586), ('just', 584), ('over', 571), ('now', 570), ('there', 568), ('down', 557), ('Professor', 554), ('Umbridge', 553), ('are', 537), ('looked', 537), ('know', 528), ('your', 510), ('more', 508), ("'re", 505), ('if', 500), ('like', 497), ('by', 496), ('got', 493), ('when', 493), ('very', 492), ('then', 488), ('around', 486), ('again', 484), ('though', 476), ('one', 471), ("'ve", 461), ('Weasley', 445), ('voice', 431), ("'You", 431), ('Hagrid', 428), ('looking', 424), ('see', 413), ('off', 413), ('think', 409), ('time', 399), ('right', 394), ('face', 387), ('no', 382), ("'ll", 381), ('going', 379), ('still', 379), ('door', 377)]

```
3. 要把处理函数用上，转换为小写。这是转换为小写运行的结果
```
[('the', 11724), ('to', 6511), ('and', 5997), ('of', 5294), ('he', 4810), ('harry', 3996), ('said', 3908), ('was', 3668), ("'s", 3320), ('his', 3251), ('in', 3072), ('you', 3006), ('it', 2910), ('that', 2378), ('had', 2366), ('--', 2329), ('at', 2143), ('on', 1930), ('as', 1802), ('her', 1778), ('him', 1607), ('with', 1584), ('not', 1518), ("n't", 1501), ('they', 1476), ('she', 1473), ('for', 1318), ('hermione', 1278), ('ron', 1275), ('but', 1126), ('from', 1075), ('be', 1051), ('have', 1028), ('were', 1012), ('up', 1011), ('all', 1004), ('them', 986), ('out', 980), ('do', 907), ('we', 904), ('what', 886), ('been', 783), ('back', 779), ('there', 779), ('did', 749), ('is', 748), ('this', 744), ('into', 692), ('so', 685), ('could', 683), ('who', 677), ('an', 654), ('dumbledore', 611), ('now', 610), ('would', 607), ('just', 606), ('sirius', 605), ('their', 600), ('me', 595), ('about', 590), ('when', 581), ('over', 579), ('if', 571), ('down', 561), ('then', 560), ('are', 556), ('professor', 555), ('umbridge', 555), ('looked', 537), ('know', 534), ('your', 525), ('more', 517), ('by', 515), ("'re", 508), ('like', 507), ('got', 507), ('one', 501), ('very', 495), ('again', 494), ('though', 492), ('around', 488), ("'you", 483), ("'ve", 476), ('weasley', 449), ('looking', 433), ('voice', 431), ('hagrid', 428), ('see', 421), ('think', 417), ('off', 416), ('no', 413), ('time', 404), ('right', 398), ('still', 395), ('face', 388), ('going', 385), ("'ll", 383), ('or', 380), ('door', 377), ('head', 372)]

```

#### 去掉停用词统计
1. 使用了nlkt中的停用词包

```
from nltk.corpus import stopwords
def count_data(cont_list):
    c = Counter()
    theStopWord = stopwords.words('english')
    print(theStopWord)
    for x in cont_list:
        if len(x)>1 and (x not in theStopWord):
            c[x] += 1
    print(c.most_common(100))
```
2. 运行结果有些瑕疵，一些词没有包含在停用词中。对比去掉了大部分，少部分可以手动添加，或者使用正则表达式匹配
![](https://gitee.com/liangxinixn/blog002/raw/master/image01/20210323222837.png)

> 最终文件在05/test.py中。

#### 形态学归一化

1. 判断相同单词的不同形态，nklt库中也包含这个功能。这里列举两个类
```
import nltk
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
print(stemmerporter.stem('working'))
print(stemmerporter.stem('happiness'))
```
> 输出 work happi

```
import nltk
from nltk.stem import LancasterStemmer
stemmerlan=LancasterStemmer()
print(stemmerlan.stem('working'))
print(stemmerlan.stem('happiness'))
```
> 输出 work happy
> 参考地址：https://github.com/PacktPublishing/Mastering-Natural-Language-Processing-with-Python/tree/master/Chapter%203