# aio-scrapy

**Repository Path**: conlin_huang/aio-scrapy

## Basic Information

- **Project Name**: aio-scrapy
- **Description**: 将基于twisted的scrapy/scrapy-redis改成基于asyncio,使用aiohttp发送请求
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-06-14
- **Last Updated**: 2022-06-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


![aio-scrapy](./doc/images/aio-scrapy.png)

### aio-scrapy

An asyncio + aiolibs crawler  imitate scrapy framework

English | [中文](./doc/README_ZH.md)

### Overview
- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
- aio-scrapy implements compatibility with scrapyd.
- aio-scrapy implements redis queue and rabbitmq queue.
- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- Distributed crawling/scraping.
### Requirements

- Python 3.7+
- Works on Linux, Windows, macOS, BSD

### Install

The quick way:

```shell
pip install aio-scrapy -U
```

### Usage

#### create project spider:

```shell
aioscrapy startproject project_quotes
```

```
cd project_quotes
aioscrapy genspider quotes 
```

quotes.py

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'

    start_urls = ['https://quotes.toscrape.com']

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy crawl quotes
```

#### create single script spider:

```shell
aioscrapy genspider single_quotes -t single
```

single_quotes.py:

```python
from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
        # 'DOWNLOAD_DELAY': 3,
        # 'RANDOMIZE_DOWNLOAD_DELAY': True,
        # 'CONCURRENT_REQUESTS': 1,
        # 'LOG_LEVEL': 'INFO'
    }

    start_urls = ['https://quotes.toscrape.com']

    @staticmethod
    async def process_request(request, spider):
        """ request middleware """
        return request

    @staticmethod
    async def process_response(request, response, spider):
        """ response middleware """
        return response

    @staticmethod
    async def process_exception(request, exception, spider):
        """ exception middleware """
        pass

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

    async def process_item(self, item):
        print(item)


if __name__ == '__main__':
    QuotesMemorySpider.start()

```

run the spider:

```shell
aioscrapy runspider quotes.py
```


### more commands:

```shell
aioscrapy -h
```

### Documentation
[doc](./doc/documentation.md)

### Ready

please submit your sugguestion to owner by issue

## Thanks

[aiohttp](https://github.com/aio-libs/aiohttp/)

[scrapy](https://github.com/scrapy/scrapy)