# scrapy-redis分布式 **Repository Path**: qilinx/spiderredis_distributed ## Basic Information - **Project Name**: scrapy-redis分布式 - **Description**: 基于scrapy-redis做的抓取百度贴吧的分布式小爬虫 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 1 - **Created**: 2018-06-28 - **Last Updated**: 2021-01-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # scrapy-redis分布式 #### 项目介绍 基于scrapy-redis做的一个抓取百度贴吧的分布式小爬虫 #### 安装教程 1. 安装python环境 在 https://www.python.org/downloads/ 下载和本机适配的环境安装 2. pip安装scrapy pip install scrapy 3. pip安装scrapy-redis pip install scrapy-redis 准备 1. scrapy startproject myRedisSpider 创建scrapy爬虫项目 准备 Redis (仅需Master端启动) 官网: https://redis.io/download 桌面图形化软件: https://redisdesktop.com/download redis命令(cd到redis安装目录): 启动redis服务 连接:redis-cli 连接其他主机: redis-cli -h localhost 修改redis文件目录下的redis.windows.conf配置文件 注释掉 bind 127.0.0.1 关闭保护模式(protected-mode no)或设置密码 重新启动redis服务 scrapy-redis-0.6.8 操作步骤 创建项目 scrapy startproject myRedisSpider 编写要存储的内容(items.py) ``` class TiebaItem(scrapy.Item): name = scrapy.Field() summary = scrapy.Field() person_sum = scrapy.Field() text_sum = scrapy.Field() # 数据来源 from_url = scrapy.Field() from_name = scrapy.Field() time = scrapy.Field() ``` 创建爬虫 cd 到项目目录(myRedisSpider)下 scrapy genspider tieba tieba.baidu.com 爬虫文件创建在mySpider.spiders.tieba 编写爬虫 ``` # -*- coding: utf-8 -*- import scrapy from scrapy_redis.spiders import RedisSpider from myRedisSpider.items import TiebaItem class TiebaSpider(RedisSpider): name = 'tieba' allowed_domains = ['tieba.baidu.com'] page = 1 page_max = 30 url = 'http://tieba.baidu.com/f/index/forumpark?cn=%E5%86%85%E5%9C%B0%E6%98%8E%E6%98%9F&ci=0&pcn=%E5%A8%B1%E4%B9%90%E6%98%8E%E6%98%9F&pci=0&ct=1&st=new&pn=' redis_key = 'tieba:start_urls' def parse(self, response): tieba_list = response.xpath('//div[@class="ba_content"]') for tieba in tieba_list: # 从网页中获取需要的数据 name = tieba.xpath('./p[@class="ba_name"]/text()').extract_first() #贴吧名称 summary = tieba.xpath('./p[@class="ba_desc"]/text()').extract_first() #贴吧简要信息 person_sum = tieba.xpath('./p/span[@class="ba_m_num"]/text()').extract_first() #贴吧总人数 text_sum = tieba.xpath('./p/span[@class="ba_p_num"]/text()').extract_first() #文章总数 item = TiebaItem() item['name'] = name item['summary'] = summary item['person_sum'] = person_sum item['text_sum'] = text_sum item['from_url'] = response.url #数据来源url yield item if self.page < self.page_max: self.page += 1 yield response.follow(self.url + str(self.page), callback=self.parse) ``` 配置settings.py文件 (在最下方添加以下设置,其他不动) ``` ITEM_PIPELINES = { 'myRedisSpider.pipelines.MyredisspiderPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 400, # 将数据存到redis数据库 (优先级要比其他管道低(数值高)) } # scrapy_redis去重组件 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # scrapy_redis调度器组件 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 可中途暂停, 不清空信息 SCHEDULER_PERSIST = True # 默认Scrapy队列模式, 优先级 # SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" # 队列模式, 先进先出 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" # 栈模式, 先进后出 #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack" # redis主机 REDIS_HOST = 'localhost' REDIS_PORT = '6379' REDIS_ENCODING = "utf-8" ``` 设置值的通道(Pipeline.py) ``` class MyredisspiderPipeline(object): def process_item(self, item, spider): item['time'] = datetime.datetime.utcnow() item['from_name'] = spider.name return item ``` 运行 Slaver端开启爬虫: (scrapy runspider 爬虫文件名.py) cd到myRedisSpider\spiders爬虫目录 scrapy runspider tieba.py Master端发送指令: (lpush redis_key 爬虫起始url) cd到redis安装目录并输入redis-cli连接redis 输入指令 lpush tieba:start_urls http://tieba.baidu.com/f/index/forumpark?cn=%E5%86%85%E5%9C%B0%E6%98%8E%E6%98%9F&ci=0&pcn=%E5%A8%B1%E4%B9%90%E6%98%8E%E6%98%9F&pci=0&ct=1&st=new&pn=1 跟进爬虫CrawlSpiders 创建命令: scrapy genspider -t crawl tieba_crawl tieba.baidu.com 爬虫代码 ``` # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor # from scrapy.spiders import CrawlSpider, Rule from scrapy.spiders import Rule # 1. 把 CrawlSpider 换成 RedisCrawlSpider from scrapy_redis.spiders import RedisCrawlSpider from myRedisSpider.items import TiebaItem # 普通的 CrawlSpider 爬虫改造成分布式的 RedisCrawlSpider 爬虫 [1,4] # class TiebaSpider(CrawlSpider): # 2. 把继承 CrawlSpider 换成 RedisCrawlSpider class TiebaSpider(RedisCrawlSpider): name = 'tieba_crawl' # 不要使用动态域, 因为会导致获取不到域而将所有请求过滤 Filtered offsite request to 'tieba.baidu.com' allowed_domains = ['tieba.baidu.com'] # 3. 注释 start_urls # start_urls = ['http://tieba.baidu.com/f/index/forumpark?cn=%E5%86%85%E5%9C%B0%E6%98%8E%E6%98%9F&ci=0&pcn=%E5%A8%B1%E4%B9%90%E6%98%8E%E6%98%9F&pci=0&ct=1&st=new&pn=1'] # 4. 启动所有爬虫端的指令 (爬虫名:start_urls) redis_key = 'tiebacrawl:start_urls' rules = ( Rule(LinkExtractor(allow=('pn=\d+')), callback='parse_item', follow=True), # st=new& ) def parse_item(self, response): tieba_list = response.xpath('//div[@class="ba_content"]') for tieba in tieba_list: name = tieba.xpath('./p[@class="ba_name"]/text()').extract_first() summary = tieba.xpath('./p[@class="ba_desc"]/text()').extract_first() person_sum = tieba.xpath('./p/span[@class="ba_m_num"]/text()').extract_first() text_sum = tieba.xpath('./p/span[@class="ba_p_num"]/text()').extract_first() item = TiebaItem() item['name'] = name item['summary'] = summary item['person_sum'] = person_sum item['text_sum'] = text_sum # 告知数据从哪里来 item['from_url'] = response.url item['from_name'] = self.name yield item ``` 运行爬虫 scrapy runspider tieba_crawl.py 发送指令 lpush tiebacrawl:start_urls http://tieba.baidu.com/f/index/forumpark?cn=%E5%86%85%E5%9C%B0%E6%98%8E%E6%98%9F&ci=0&pcn=%E5%A8%B1%E4%B9%90%E6%98%8E%E6%98%9F&pci=0&ct=1&st=new&pn=1