# django_scrapyd

**Repository Path**: tuyutian/django_scrapyd

## Basic Information

- **Project Name**: django_scrapyd
- **Description**: 使用django和scrapyd结合开发的爬虫
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2020-05-16
- **Last Updated**: 2021-09-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

pip install -r requirements.txt
#x-www-form-urlencoded
#post
#url = http://127.0.0.1:8000/spider/scrapy
#运行环境 scrapyd,scrapydweb,django,logparser,selenium windows下需要安装pywin32，chrome浏览器,chromedriver.exe已放于根目录
chromedriver需要与浏览器版本匹配，目录下版本为 81.0.4044.69
chromedriver下载地址http://npm.taobao.org/mirrors/chromedriver/
运行命令
一定要进入进入 ContentSpider  : scrapyd
根目录 py manage.py runserver 8500
根目录 scraydweb
进入 ContentSpider  : logparser  -dir E:/xxx/scrapy_site/ContentSpider/logs
还有两个setting文件的配置项需要改

然后开启队列监听scheduler.py

百度图片存储：

域名https://cdnimg.xxxxx

bucketName

ixxxx

Access Key

c222b7a0936049xxxxx

Secret Key

fe95ef95a3294973xxxxx


#postman参数json示例
{
  "add_time": "2020-04-22",
  "allowed_domains": " ",
  "cate_id": 4,
  "charset": "uft-8",
  "id": 1,
  "list_xpath": " .//u/li/a/@href",
  "rules": [
    {
      "match": ".//h1[@class=\"main-title\"]/text()",
      "name": "title"
    },
    {
      "key": 1587637191655,
      "match": ".//div[@class=\"date-source\"]/a[@class=\"source ent-source\"]/text()",
      "name": "author",
      "value": ""
    },
    {
      "key": 1587637233828,
      "match": ".//div[@class=\"channel-path\"]/a[2]/text()",
      "name": "tag",
      "value": ""
    },
    {
      "key": 1587637245922,
      "match": ".//div[@id=\"artibody\"]",
      "name": "content",
      "value": ""
    },
    {
      "key": 1587637281180,
      "match": ".//div[@class=\"date-source\"]/span[@class=\"date\"]/text()",
      "name": "create_time",
      "value": ""
    }
  ],
  "spider_name": "新浪体育中超列表爬虫",
  "start_urls": "http://sports.sina.com.cn/csl/",
  "update_time": "2020-04-22",
  "url_contain": " ",
  "url_no_contain": " ",
  "url_type": 1
}