# CrawlerUtility

**Repository Path**: chugf/CrawlerUtility

## Basic Information

- **Project Name**: CrawlerUtility
- **Description**: No description available
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-11-14
- **Last Updated**: 2024-11-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CrawlerUtility

Simplify the development of your webcrawler

# Usage

## Install

```
pip install --upgrade git+https://github.com/kingname/CrawlerUtility.git
```

## Common Utility

You can use this module without installing any third-part packages.

### ChromeHeaders2Dict

```
from CrawlerUtility import ChromeHeaders2Dict

chrome_headers = '''
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
    Accept-Encoding: gzip, deflate, br
    Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
    Connection: keep-alive
    Cookie: BAIDUID=E40AF2FAEC8CB0F382A3A8F5F59AC44D:FG=1; BIDUPSID=E40AF2FAEC8CB0F382A3A8F5F59AC44D; PSTM=1513916193; BDRCVFR[C0-VKBuJmg_]=mk3SLVN4HKm; BD_CK_SAM=1; pgv_pvi=8525405184; pgv_si=s3529928704; FP_UID=5eea85cb6e65c4d7a9f0f7b9d23ff3cb; BDRCVFR[w2jhEs_Zudc]=I67x6TjHwwYf0; BD_UPN=123253; shifen[62291884541_98248]=1520672084; BCLID=11171094791344044520; BDSFRCVID=LNCsJeC62ZBf13rACvOD-ViSJHNR0mTTH6aoKULoKvtI-AUyiIRrEG0PqU8g0Ku-sN62ogKK0mOTHvbP; H_BDCLCKID_SF=tbKq_DLXf-bSK4b1-4QD2DCShUFsWU6m-2Q-5KL-yqothDO4Lfb-XU3D3xrgBfvwLJRL-UbdJJjoOU5shUR-5McDLJo8axcN-eTxoUJhQCnJhhvGqJbFj6DebPRiJPr9Qgbq3ftLK-oj-D-mD55P; PSINO=7; MCITY=-131%3A; BDUSS=Vdic1Z6WHhEaGhvSW1KflhWUVYwcFRhemI0RjhDdjVmcGF1bktaVkNWQnppZDVhQUFBQUFBJCQAAAAAAAAAAAEAAACoVyMi1MLC5F-zpLCyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHP8tlpz~LZac; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; H_PS_PSSID=1993_1436_21094_18560_22157; sugstore=1; BDSVRTM=0
    DNT: 1
    Host: www.baidu.com
    Upgrade-Insecure-Requests: 1
    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
    '''
    headers_dict = ChromeHeaders2Dict(chrome_headers)
    headers_dict
```

## Scrapy Utility

If you want to use this module, you must install `Scrapy` first.

### AbuyunProxyMiddleware


Modify `settings.py` of your Scrapy project:

```
DOWNLOADER_MIDDLEWARES = {
   'CrawlerUtility.scrapy_utility.ScrapyUtility.AbuyunProxyMiddleware': 548,
}

ABUYUN_PROXY_SERVER = 'http://http-dyn.abuyun.com:9020' # must be set
ABUYUN_PROXY_USER = 'DWE2341LFOWQC4' # must be set
ABUYUN_PROXY_PASSWORD = '94SLIC1304C' # must be set
SPIDER_BEHIND_PROXY = ['BaiduSpider', 'QQSpider'] # list of spider name. if not set, all spider will be behind the proxy
SKIP_PROXY_KEYWORD = ['http://google.com', 'safeurl.com/aaa'] # the urls will not use proxy if they satisfy these pattern in the list,
```

### LogRequestUrlMiddleware

In default, Scrapy will only log the response's url. But what if the request follow redirect(s)? And sometimes you make 10 post
but Scrapy only show 5 response, you don't know if your request 5 urls in fact or the responses are missing or blocked.
This module solves your problem by logging the request url.

To use this module, you should change Scrapy's `settings.py`. Pay special attention for that this is a `Spider Middleware`,
NOT a `Downloader middleware`

```
SPIDER_MIDDLEWARES = {
   'CrawlerUtility.scrapy_utility.ScrapyUtility.LogRequestUrlMiddleware': 548,
}

# As log request urls will remarkablely scale up log, you should use the following settings to limit what request can be logged.

SPIDER_SHOW_REQUESTS_URL = ['test'] # spiders which you want to log request urls
PATTERN_SHOW_REQUESTS_URL = ['httpbin', 'kingname'] # only urls which satisfy the pattern will be logged.
```