# KejsoSpider

**Repository Path**: supergame/KejsoSpider

## Basic Information

- **Project Name**: KejsoSpider
- **Description**: 垂直结构化的爬虫，将常见的网页数据形式结构化。
- **Primary Language**: Java
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 2
- **Created**: 2017-12-28
- **Last Updated**: 2020-12-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

##KejsoSpider

抽取常见的网页结构数据。

###设计说明

Pipeline 以mysql为基本的pipeline(MysqlPipeline)，pipeline队列中可以添加FilePipeline、SolrPipeline。



###常见网页结构和爬取模式


###增量策略



###配置文件说明

##### 全局配置

TaskName  任务名称

Thread    默认线程数

ProxyEnable 是否开启代理

CycleTimes 爬虫循环重试次数

SleepTime  抓取间隔

MoreSleepTime  [True | False]  重试时是否增加抓取间隔
 

##### 组件配置

**ListConfig 列表页面配置**

          ListUrl  列表页url模板

          PageEnable 开启多页

          PageStart  初始页码

          PageEnd    结束页码

如果开启多页，则列表页会包含一个url队列。

          ListValue  列表项定位，目前只支持XPATH。(CSS选择器扩展)

          SqlTable   存储到数据库的表名。(索引库扩展)

          TableFields  数据表的字段，默认包含id为主键

          UniqueField  unique索引字段，用来标识记录的唯一性，重复的记录不再存储

          ListTag    网页动态字段定位

               ——TagName  对应的字段
        
               ——TagValue 字段定位
         
          ConstTag   常量字段


**ContentConfig 内容面配置**

          ContentTable  表名
   
          TableFields  数据表字段

          UniqueField  唯一索引字段

          NotNullField 非空字段(允许多个)

          PageUrlField 内容页url字段

          ContentTag   网页动态字段定位

                ——TagName  对应的字段

                ——TagValue 字段定位               
  
          ContentList  map字段映射
                
                ——Field  对应字段列表
                
                ——MarkField 对应页面标记

                ——Mark 页面标记定位
    
                ——Code 页面内容定位


**Spiders 爬虫链配置**
          
          Spider  爬虫配置 , name 属性定义Spider名称，供其他的Spider引用。
    
                —— conf-def  爬虫依赖的配置，class 属性指定类别，name属性指定特定名称

                —— depend  当前爬虫所依赖的爬虫链中的上一个爬虫，ref属性引用所依赖爬虫的name，field 属性通常指定 依赖爬虫提供的url字段。

                —— recover 爬虫的增量策略。enable="true"开启 ; mode属性指定选择的策略，有三种策略可供选择。field和url是前后爬虫对应的字段。

    



###使用说明

Usage: java -jar BuildSpiderChain.jar  configfile  jdbc-config 

Usage: java -jar RunSpiderChainForMagazineFromFile.jar sourceurl  configfile  jdbc-config