# 中文爬虫-gensim分词修正误拆分 **Repository Path**: dB3gK0r/gensim_phrases ## Basic Information - **Project Name**: 中文爬虫-gensim分词修正误拆分 - **Description**: 中文爬虫的代码,需不断维护(故只是个selenium+chromedriver+demo,欢迎白嫖); 主要是使用gensim分词基于词频修正jieba中文分词误拆分(其他什么的分词也可以) - **Primary Language**: Python - **License**: GPL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2021-06-05 - **Last Updated**: 2022-12-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README 1.运行seledemo.py,需chrome和chromedriver(或(firefox)gechodriver),得 abstracts.csv慢 2.运行lv.py(用Levenshtein代替drop_duplicates),得abstracts_lv_clear.csv(不含复制粘贴) 3.cut_and_draw.py 取得 abstracts_phrase.csv #十年被审核功力 1. run seledemo.py, with chrome and chromedriver ( or firefox and (firefox)gechodriver) configurated, to crawl down raw abstracts.csv (slowly) 2. and run lv.py (use Levenshtein to extend drop_duplicates), get abstracts_lv_clear.csv (no copy and paste) 3. run phrase_draw