# myscrapy

**Repository Path**: wyu_001/myscrapy

## Basic Information

- **Project Name**: myscrapy
- **Description**: myscrapy 类似Scrapy的简单的爬虫工具，入门简单易学，完整学习范例！
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-03-04
- **Last Updated**: 2024-05-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: Scrapy

## README

# myscrapy

#### 介绍
myscrapy 简单的爬虫工具，本工具参考scrapy编程风格开发，编写爬虫脚本写法基本一致，
内部实现简单，单个脚本单线程运行；

#### 软件架构
软件架构说明


#### 安装教程

1.  安装 python3  
    
2.  安装依赖库：
    pip install pandas pymysql sqlalchemy lxml selenium numpy requests openpyxl 


#### 使用说明

1. 软件参数设置说明：  
   config目录下setting.py 设置运行相关参数；  
   设置环境参数:
   PYTHONPATH = myscrapy目录
        

2. 静态网页（html，json数据）爬取一般需要继承 MySpider 类 范例： template/reqhtml.py reqjson.py 


3. 动态网页使用selenium模拟爬取，继承类 MySelenium 类 范例： template/reqselenium.py  

  
4. batch.py 脚本可以批量运行spider目录下的爬虫脚本；

开发范例：
    在spider目录下新建test.py 文件 
    
    from common.myspider import MySpider
    from common.item import Item

    from common.log import loger

    class ReqHtml(MySpider):
        name = 'test'
        start_urls = ['http://www.test.com/']

        def __init__(self):
            super().__init__()

            self.excel_item = Item()
            self.excel_item.setdefault("province","test")
            self.excel_item.setdefault("unit_name","test")


        def parse(self, response):

            for test_node in response.xpath('//ul[@class="doctor-list clearfix"]/li'):
                link =test_node.xpath('./a/@href').getall()
                name = test_node.xpath('./div/h3/text()').getall()
                loger.info(f'link {link} name {name}')

                pass_item = self.excel_item.copy()
                pass_item.setdefault("test_name",name)


                response.follow(url=link, callback=self.parse_detail, cb_kwargs={'item': pass_item})

            next_page = response.xpath('//a[contains(text(),"下一页")]/@href').get()

            if next_page:
                response.follow(url=next_page, callback=self.parse)

        def parse_detail(self, response, item):

            brief = response.xpath('//div[@class="p20 mt20 zjxq"]//text()').getall()
            loger.info(brief)

            excel_item = item.copy()
            excel_item.setdefault("test_brief",brief)

            loger.info(excel_item)
            self.append(excel_item)

    if __name__ == '__main__':
        ReqHtml().start_request()
可以参考template目录下3个范例代码


#### 参与贡献


#### 特技

1.  使用 Readme\_XXX.md 来支持不同的语言，例如 Readme\_en.md, Readme\_zh.md
2.  Gitee 官方博客 [blog.gitee.com](https://blog.gitee.com)
3.  你可以 [https://gitee.com/explore](https://gitee.com/explore) 这个地址来了解 Gitee 上的优秀开源项目
4.  [GVP](https://gitee.com/gvp) 全称是 Gitee 最有价值开源项目，是综合评定出的优秀开源项目
5.  Gitee 官方提供的使用手册 [https://gitee.com/help](https://gitee.com/help)
6.  Gitee 封面人物是一档用来展示 Gitee 会员风采的栏目 [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)