# web2020-lab1

**Repository Path**: leoncoci/web2020-lab1

## Basic Information

- **Project Name**: web2020-lab1
- **Description**: USTC web 2020 lab1
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2020-11-22
- **Last Updated**: 2021-08-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# enron-search-engine

## 运行环境：
- Windows 10
- python 3.7

## 运行方式：：
```bash
# 安装依赖
pip install nltk
pip install numpy
pip install scipy
python
# 由于已经在output中存储了倒排索引和tfidf矩阵因此无需运行perprocess.py和build_index.py 
# 另：如果想重新分词和生成倒排表，需要对perprocess.py文件做一些修改
# 开始检索
python ./search.py
```

## 关键函数：

`build_index()`主要函数，遍历所有文件以构建倒排索引和 tf-idf 矩阵，同时将倒排索引和tfidf矩阵以合适的方式存储

`preProcessFiles(filesPath)`对邮件文档进行预处理，去除邮件头中发件人等信息并分词。

`id_filepath(txtID, fPath)`将文档编号并将编号同文档路径对应

`to_filepath(txtID=None)`读取id-filepath匹配文件返回文档id和文档路径一一对应的字典数据

`semantic_search(Q, Array, Matrix)`语义检索主函数

`bool_search(indexdict)`bool检索主函数


## 额外文件说明：

`src/search.py`使用者可以选择调用bool检索、语义检索和退出检索过程。

`src/build_index.py`构建倒排表、tf-idf 矩阵

`src/match_file.py`综合文档编号同文档路径匹配的函数文件

`src/perprocess.py`对文件预处理源码

`output/id_to_file_path`文档ID和路径匹配文件

`output/word_array`tf-idf矩阵行标同单词匹配文件

## 特殊情况

如果出现了错误提示：MemoryError: Unable to allocate 1.93 GiB for an array with shape (1000, 517401) and data type float32

这不是程序bug，是因为dtype的设置不同，我们的是float16