# Scrapy-SearchEngines

**Repository Path**: rarpainting/Scrapy-SearchEngines

## Basic Information

- **Project Name**: Scrapy-SearchEngines
- **Description**: bing、google、baidu搜索引擎爬虫。python3.6 and scrapy
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2021-02-06
- **Last Updated**: 2023-08-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# seCrawler(Search Engine Crawler)
A scrapy project can crawl search result of Google/Bing/Baidu

Copying by https://github.com/xtt129/seCrawler

Thank you for sharing

## prerequisite
python 3.6 and scrapy is needed.


## commands

run one command to get 50 pages result from search engine with keyword, the result would be kept in the "urls.txt" under the current directory.


####Bing
```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50```

####Baidu
```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50```

####Google
```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50```

## limitation
The project doesn't provide any workaround to the anti-spider measure like CAPTCHA, IP ban list, etc. 

But to reduce these measures, we recommand to set ```DOWNLOAD_DELAY=10``` in settings.py file to add a temporisation (in second) between the crawl of two pages, see details in [Scrapy Setting](https://doc.scrapy.org/en/1.2/topics/settings.html#std:setting-DOWNLOAD_DELAY).

## Chinese
本项目用于bing、google、baidu搜索引擎关键词的抓链，基于python 3.6和scrapy。

根据 https://github.com/xtt129/seCrawler 提供的项目进行小小改动以适应3.6版本。

使用方法：
---进入项目目录执行指令---

Bing：

```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=bing -a pages=50```

Baidu：

```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=baidu -a pages=50```

Google：

```scrapy crawl keywordSpider -a keyword=Spider-Man -a se=google -a pages=50```

本项目没有保护IP的功能，过度爬取可能会被封杀IP，可以尝试延长下载时间间隔：
在settings.py中进行配置，例：```DOWNLOAD_DELAY=10```