scrapy: scrapy爬虫案例

scrapy爬虫案例-20154B129梁海斌

需求：爬取京东商城，类别为手机的商品信息

一、开发环境

python3.7、scrapy框架、pymsql操作数据库、xpath解析页面

IDE：pycharm

数据库：mysql8.0

数据保存：json和MySQL数据库

二、页面分析

起始页面：https://search.jd.com/Search?keyword=手机&enc=utf-8&wq=手机&pvid=08ec44d1313844ac89af45f29e1798ee

一级页面的url构成，第n页

https://search.jd.com/Search?keyword=手机&pvid=fc0f877798554d0482d8f84f204f3c2a&page=（2*n-1）&s=（n-1)*60+1&click=0

如：第2页https://search.jd.com/Search?keyword=手机&pvid=fc0f877798554d0482d8f84f204f3c2a&page=3&s=56&click=0

第n页对应的page值为(2*n)-1，每页显示60个商品

第n页对应的起始位置s=(n-1)*60+1

所以我们转化为第n页的page值为n，每一个page显示30个商品。

三、页面解析过程

先从起始页进入，搜索“手机”页面，即一级页面。从一级页面中，我们获取的数据有：手机图片，价格，详细信息，所属店铺。还有，进入二级页面（商品详细介绍页）的url。

二级页面主要解析获取手机颜色，品牌，商品名称，重量四个数据。由于其他数据的数据，页面结构不规则，所以比较难获取，这里就不去获取。

以上的数据均使用scrapy框架的xpath方法解析。

从起始页获取一部分数据，在进入二级页面获取数据。

四、开发流程

创建scrapy项目

(base) D:\workspace_crawler>scrapy startproject jingdong

创建爬虫文件

(base) D:\workspace_crawler\jingdong\jingdong\spiders>scrapy genspider jd https://www.jd.com/

创建完项目后，如果在pycharm开发，要把父目录改成Sources Root目录，不然导入items.py的内容会报错。

settings.py配置

禁用ROBOTS协议

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

配置管道（多管道配置）

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 100,  
   'jingdong.pipelines.MysqlPipeline':101  # 写入MySQL数据库管道
}

MySQL数据库配置

# mysql数据库配置
# 主机地址
DB_HOST = 'localhost'
# 端口号为整数
DB_PORT = 3306
# 用户
DB_USER = 'root'
# 密码
DB_PASSWORD = '123456' 
# 数据库名
DB_NAME = 'crawl'
# utf-8 要改成 utf8
DB_CHARSET = 'utf8'

主程序代码

jd.py

import scrapy
from jingdong.items import JingdongItem

class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['jd.com']
    start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&wq=%E6%89%8B%E6%9C%BA&pvid=8f88cb3fdb694b52bfc13005cf0ba2f2']

    base_url = 'https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&wq=%E6%89%8B%E6%9C%BA&pvid=8f88cb3fdb694b52bfc13005cf0ba2f2&page='

    page = 2  # 页码
    n = 31  # 第二页开始位置
    def parse(self, response):
        # print("====================================================")
        # print(response.text)
        # print("====================================================")
        # 获取整所有的商品标签 li
        conn = response.xpath("//div[@id='J_goodsList']/ul//li[@class='gl-item']")

        # 遍历每一个li
        for i in conn:
            # 价格
            price = i.xpath(".//div[@class='p-price']//i/text()").extract_first()
            # 详细信息
            detail = i.xpath(".//div[@class='p-name p-name-type-2']//em/text()").extract_first().strip()
            # 店铺
            store = i.xpath(".//span[@class='J_im_icon']/a/text()").extract_first()
            # 图片地址
            img = i.xpath(".//div[@class='p-img']//img/@data-lazy-img").extract_first()
            # 二级页面
            url = i.xpath(".//div[@class='p-name p-name-type-2']/a/@href").extract_first()

            # print("=========================================================")
            url = "https:"+url
            # print(url)
            # url请求地址
            # callback 执行哪个函数
            # 回调parse_new函数， 处理二级页面
            # mata传递数据
            yield scrapy.Request(url=url, callback=self.parse_new,meta={"price":price,"img":img,"store":store,"detail":detail})

        # 获取70页数据
        if self.page < 71:
            print("================================================================================================"+str(self.page)+"==========")
            # 一级页面
            burl = self.base_url + str(self.page) + "&s=" + str(self.n) + "&click=0"
            self.page = self.page + 1
            self.n = self.n + 30
            # 回调，解析一级页面
            yield scrapy.Request(url=burl, callback=self.parse)


    # 二级页面解析页面
    def parse_new(self,response):
        print("==============================================")
        # meta[key] 获取数据
        # print(response.meta['price'])
        # print("http:"+response.meta['img'])
        # print(response.meta['store'])
        # print(response.meta['detail'])
        # 颜色
        color = response.xpath("//div[@id='choose-attr-1']//div[@class='item  selected  ']/@data-value").extract_first()
        # 品牌
        pinpai = response.xpath("//div[@class='p-parameter']//ul[@id='parameter-brand']//a/text()").extract_first()
        # 手机名
        pname = response.xpath("//ul[@class='parameter2 p-parameter-list']/li/@title").extract_first()
        # 重量
        weight = response.xpath("//ul[@class='parameter2 p-parameter-list']//li[3]/@title").extract_first()
        # print(color)
        # print(pinpai)
        # print(pname)
        # print(weight)

        # 创建phone对象
        phone = JingdongItem(
            pname = pname, # 手机名
            pinpai = pinpai,  # 品牌
            store = response.meta['store'],  # 店铺
            price = response.meta['price'],   # 价格
            color = color,   # 颜色
            weight = weight,  #  重量
            detail = response.meta['detail'],  # 详细信息
            img = "http:"+response.meta['img']   # 图片地址
        )

        # 获取一个phone就将phone交给pipelines
        yield phone

items.py

import scrapy

class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pname = scrapy.Field()  # 手机名
    pinpai = scrapy.Field()  # 品牌
    store = scrapy.Field()  # 店铺
    price = scrapy.Field()  # 价格
    color = scrapy.Field()  # 颜色
    weight = scrapy.Field()  # 重量
    detail = scrapy.Field()  # 详细信息
    img = scrapy.Field()  # 图片路径

pipelines.py

import json
import pymysql
from scrapy.utils.project import get_project_settings  # 导入配置文件模块

'''
settings.py把以下注释解开
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 100,
}
'''
# 写入json文件
class JingdongPipeline:
    # 在爬虫文件开始之前执行的方法
    def open_spider(self, spider):
        self.f = open('phone.json', 'a', encoding='utf-8')

    # item就是yield后面的phone对象
    def process_item(self, item, spider):
        # 1.write()必须写字符串
        # 2.w 会每一个对象都打开一次文件，覆盖之前的内容
        jsontext = json.dumps(dict(item), ensure_ascii=False) + ", "
        self.f.write(jsontext)
        return item

    # 爬虫文件执行之后执行的方法
    def close_spider(self, spider):
        self.f.close()



# 多条管道开启
# 定义管道类，在settings中开启管道
'''
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 100,
   'jingdong.pipelines.MysqlPipeline':101  # 优先级 值越小优先级越高
}
'''
# 导入数据库
class MysqlPipeline:
    # 开始爬虫前执行
    def open_spider(self, spider):
        # get_project_settings 读取settings文件
        settings = get_project_settings()
        # 配置连接数据库的信息
        self.host = settings['DB_HOST']  # 主机名
        self.port = settings['DB_PORT']  # 端口号
        self.user = settings['DB_USER']  # 用户
        self.password = settings['DB_PASSWORD']  # 密码
        self.name = settings['DB_NAME']  # 数据库名
        self.charset = settings['DB_CHARSET']   # 字符集
        # 连接数据库
        self.connect()

    # 连接数据库函数
    def connect(self):
        # pymysql连接数据库，创建Connection对象
        self.conn = pymysql.connect(
            host=self.host,
            port=self.port,
            user=self.user,
            db=self.name,
            charset=self.charset,
            password=self.password

        )

        # 创建Cursor对象，主要作用是执行sql
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        # 插入数据
        sql = 'insert into phone(pname,pinpai,store,price,color,weight,detail,img) ' \
              'values("{}","{}","{}","{}","{}","{}","{}","{}")'\
            .format(item['pname'],
                    item['pinpai'],
                    item['store'],
                    item['price'],
                    item['color'],
                    item['weight'],
                    item['detail'],
                    item['img'])
        # 执行sql
        self.cursor.execute(sql)

        # 提交事务
        self.conn.commit()

        return item

    def close_spider(self,spider):
        # 关闭资源
        self.cursor.close()
        self.conn.close()

创建数据库

CREATE DATABASE crawl;
USE crawl;
CREATE TABLE phone(
	id INT(11) AUTO_INCREMENT PRIMARY KEY COMMENT '主键id',
	pname VARCHAR(100) NULL DEFAULT '' COMMENT '手机名',
	pinpai VARCHAR(100) NULL DEFAULT '' COMMENT '品牌',
	store VARCHAR(100) NULL DEFAULT '' COMMENT '店铺',
	price VARCHAR(100) NULL DEFAULT '' COMMENT '价格',
	color VARCHAR(50) NULL DEFAULT '' COMMENT '颜色',
	weight VARCHAR(50) NULL DEFAULT '' COMMENT '重量',
	detail VARCHAR(100) NULL DEFAULT '' COMMENT '详细信息',
	img VARCHAR(100) NULL DEFAULT '' COMMENT '图片地址'
	
);

运行爬虫文件

(base) D:\workspace_crawler\jingdong\jingdong\spiders>scrapy crawl jd

运行前，请注意数据库的配置信息。

建议多执行几次，选用数据最多的数据集。

json数据

数据库数据

梁海斌/scrapy

scrapy爬虫案例-20154B129梁海斌

一、开发环境

二、页面分析

三、页面解析过程

四、开发流程

创建scrapy项目

settings.py配置

主程序代码

jd.py

items.py

pipelines.py

创建数据库

简介

发行版

贡献者

近期动态

梁海斌/scrapy .gitee-modal { width: 500px !important; }

scrapy爬虫案例-20154B129梁海斌

一、开发环境

二、页面分析

三、页面解析过程

四、开发流程

创建scrapy项目

settings.py配置

主程序代码

jd.py

items.py

pipelines.py

创建数据库

简介

发行版

贡献者

近期动态

搜索帮助

梁海斌/scrapy