# Stop Word Lists in Free Open-source Software Packages **Repository Path**: codenotes/stop-word-lists-in-free-open-source-software-packages ## Basic Information - **Project Name**: Stop Word Lists in Free Open-source Software Packages - **Description**: 个人学习之用 转自:https://www.aclweb.org/anthology/W18-2502/ 用于语言处理的开源软件包通常包括停用词列表。用户可能在应用它们时没有意识到其令人惊讶的遗漏(例如,“hasn’t”但没有“hadn’t”)和包含项(“计算机”),或者与特定令牌生成器不兼容。受关于Scikit学习停止列表的问题的影响,我们调查了52种流行英语停止列表之间的差异和一致性,并提出了缓解这些问题的策略。 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2020-12-24 - **Last Updated**: 2022-05-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Supplementary material for *Stop Word Lists in Free Open-source Software Packages* NLP-OSS Workshop 2018 Attached are the primary data from https://github.com/igorbrigadir/stopwords, as well as assorted scripts used to perform analysis presented in the paper. Datasets: stopwords/ contains our data taken from https://github.com/igorbrigadir/stopwords. (See Section 4) Scripts: preprocessing/ contains scripts for data preprocessing cluster-stop-lists.py contains the script to generate hierarchically-clustered heatmap for stop word lists. (See Section 5 & Figure 5) *.ipynb contain the analysis in section 6 upset-with-words.py contains the script to generate upset plot for certain words. (See Section 6.2 & Figure 4) incompleteness-analysis.py contains the script to explore the incompleteness for stop word lists. (See Section 6.3)