# scrapy-redis

**Repository Path**: lkzhang/scrapy-redis

## Basic Information

- **Project Name**: scrapy-redis
- **Description**: No description available
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-28
- **Last Updated**: 2022-08-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

============
Scrapy-Redis
============

.. image:: https://readthedocs.org/projects/scrapy-redis/badge/?version=latest
        :alt: Documentation Status
        :target: https://readthedocs.org/projects/scrapy-redis/?badge=latest

.. image:: https://img.shields.io/pypi/v/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://img.shields.io/pypi/pyversions/scrapy-redis.svg
        :target: https://pypi.python.org/pypi/scrapy-redis

.. image:: https://img.shields.io/travis/rmax/scrapy-redis.svg
        :target: https://travis-ci.org/rmax/scrapy-redis

.. image:: https://codecov.io/github/rmax/scrapy-redis/coverage.svg?branch=master
        :alt: Coverage Status
        :target: https://codecov.io/github/rmax/scrapy-redis

.. image:: https://requires.io/github/LuckyPigeon/scrapy-redis/requirements.svg?branch=master
        :alt: Requirements Status
        :target: https://requires.io/github/rmax/scrapy-redis/requirements/?branch=master

.. image:: https://img.shields.io/badge/security-bandit-green.svg
        :alt: Security Status
        :target: https://github.com/rmax/scrapy-redis
    
Redis-based components for Scrapy.

* Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
* Documentation: https://github.com/rmax/scrapy-redis/wiki.
* Release: https://github.com/rmax/scrapy-redis/wiki/History
* Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
* LICENSE: MIT license

Features
--------

* Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue.
    Best suitable for broad multi-domain crawls.

* Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as
    many as needed post-processing processes sharing the items queue.

* Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

* In this forked version: added ``json`` supported data in Redis

    data contains ``url``, ```meta``` and other optional parameters. ``meta`` is a nested json which contains sub-data.
    this function extract this data and send another FormRequest with ``url``, ``meta`` and addition ``formdata``.

    For example:

    .. code-block:: json

        { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in `scrapy spider` through response.
    like: `request.url`, `request.meta`, `request.cookies`
    
.. note:: This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.

Requirements
------------

* Python 3.7+
* Redis >= 5.0
* ``Scrapy`` >=  2.0
* ``redis-py`` >= 4.0

Installation
------------

From pip 

.. code-block:: bash

    pip install scrapy-redis

From GitHub

.. code-block:: bash

    git clone https://github.com/darkrho/scrapy-redis.git
    cd scrapy-redis
    python setup.py install

.. note:: For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
  
.. code-block:: bash

    pip uninstall scrapy-redis

Alternative Choice
---------------------------

Frontera_  is a web crawling framework consisting of `crawl frontier`_, and distribution/scaling primitives, allowing to build a large scale online web crawler.

.. _Frontera: https://github.com/scrapinghub/frontera
.. _crawl frontier: http://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html