# shuowen_spider **Repository Path**: oceanwave/shuowen_spider ## Basic Information - **Project Name**: shuowen_spider - **Description**: 说文网站的爬虫 http://www.shuowen.org - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2016-01-26 - **Last Updated**: 2022-06-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #shuowen_spider 抓取的 http://www.shuowen.org/ #### 安装运行 1. cd shuowenjiezi.git/ 2. scrapy crawl shuowen #### NOTE 1. 安装scrapy On OSX brew install openssl and then possibly brew link openssl --force if you are informed that links were not created. Install Scrapy using the following command env CRYPTOGRAPHY_OSX_NO_LINK_FLAGS=1 LDFLAGS="$(brew --prefix openssl)/lib/libssl.a $(brew --prefix openssl)/lib/libcrypto.a" CFLAGS="-I$(brew --prefix openssl)/include" pip install scrapy You can, if you wish, substitute openssl for libressl. 2. 图片中以-0.png结尾的是裁剪之后的(没有logo),另外一种是原始的 3. 表结构: CREATE TABLE `shuowenjiezi` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `chinese_character` varchar(45) CHARACTER SET utf8 DEFAULT NULL COMMENT '汉字', `character_pic` varchar(128) CHARACTER SET utf32 NOT NULL COMMENT '小篆图片', `volume` varchar(45) CHARACTER SET utf8 NOT NULL COMMENT '第几卷', `radicals` varchar(45) CHARACTER SET utf8 NOT NULL COMMENT '偏旁部首', `pinyin` varchar(45) CHARACTER SET utf8 NOT NULL COMMENT '拼音', `fanqie_zhuyin` varchar(45) CHARACTER SET utf8 NOT NULL COMMENT '反切注音', `original_text` longtext CHARACTER SET utf8 NOT NULL COMMENT '说文解字 原文', `song_xx_notes` longtext CHARACTER SET utf8 NOT NULL COMMENT '宋代 徐鉉 徐鍇 注釋', `qing_d_notes` longtext CHARACTER SET utf8 NOT NULL COMMENT '清代 段玉裁《說文解字注》\n', PRIMARY KEY (`id`), UNIQUE KEY `id_UNIQUE` (`id`), UNIQUE KEY `chinese_character_UNIQUE` (`chinese_character`) ) ENGINE=InnoDB AUTO_INCREMENT=7744 DEFAULT CHARSET=utf32 COLLATE=utf32_unicode_ci COMMENT='抓取源:http://www.shuowen.org' #### 问题 1. 目前还有约2100个字因为编码问题,没有入库成功