# haipproxy **Repository Path**: resolvewang/haipproxy ## Basic Information - **Project Name**: haipproxy - **Description**: HAipproxy是一款代理IP程序,包含代理抓取、校验和调度三个核心组件。主要特点是高可用、低时延 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: https://github.com/ResolveWang/haipproxy - **GVP Project**: No ## Statistics - **Stars**: 3 - **Forks**: 3 - **Created**: 2018-02-28 - **Last Updated**: 2024-07-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # HAipproxy [中文文档](README.md) | [README](README_EN.md) This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with **highly availability and low latency** for distributed spiders. # Features - Distributed crawlers with high performance, powered by scrapy and redis - Large-scale of proxy ip resources - HA design for both crawlers and schedulers - Flexible architecture with task routing - Support HTTP/HTTPS and Socks5 proxy - MIT LICENSE.Feel free to do whatever you want # Quick start Please go to [release](https://github.com/SpiderClub/haipproxy/releases) to download the source code, the master is unstable. ## Standalone ### Server - Install Python3 and Redis Server - Change redis args of the project *[config/settings.py](config/settings.py)* according to redis conf,such as `REDIS_HOST`,`REDIS_PASSWORD` - Install [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash) and change `SPLASH_URL` in *[config/settings.py](config/settings.py)* - Install dependencies > pip install -r requirements.txt - Start *scrapy worker*,including ip proxy crawler and validator > python crawler_booter.py --usage crawler > python crawler_booter.py --usage validator - Start *task scheduler*,including crawler task scheduler and validator task scheduler > python scheduler_booter.py --usage crawler > python scheduler_booter.py --usage validator ### Client `haipproxy` provides both [py client](client/py_cli.py) and [squid proxy](squid_update.py) for your spiders.Any clients about any languages are welcome! #### Python Client ```python3 from client.py_cli import ProxyFetcher # args are used to connect redis, if args is None, redis args in settings.py will be used args = dict(host='127.0.0.1', port=6379, password='123456', db=0) # https is used for common proxy.If you want to crawl a customized website, you'd better # write a customized ip validator according to zhihu validator fetcher = ProxyFetcher('https', strategy='greedy', redis_args=args) # get one proxy ip print(fetcher.get_proxy()) # get available proxy ip list print(fetcher.get_proxies()) # or print(fetcher.pool) ``` #### Using squid as proxy server - Install squid,copy it's conf as a backup and then start squid, take *ubuntu* for example > sudo apt-get install squid > sudo sed -i 's/http_access deny all/http_access allow all/g' /etc/squid/squid.conf > sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup > sudo service squid start - Change `SQUID_BIN_PATH`,`SQUID_CONF_PATH` and `SQUID_TEMPLATE_PATH` in *[config/settings.py](config/settings.py)* according to your OS - Update squid conf periodically > sudo python squid_update.py - After a while,you can send requests with squid proxies, the proxies url is 'http://squid_host:3128', e.g. ```python3 import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text) ``` ## Dockerize - Install Docker - Install docker-compose > pip install -U docker-compose - Change`SPLASH_URL`and`REDIS_HOST`in [settings.py](config/settings.py) ```python3 SPLASH_URL = 'http://splash:8050' REDIS_HOST = 'redis' ``` - Start all the containers using docker-compose > docker-compose up - Use [py_cli](client/py_cli.py) or Squid to get available proxy ips. ```python3 from client.py_cli import ProxyFetcher args = dict(host='127.0.0.1', port=6379, password='123456', db=0) fetcher = ProxyFetcher('https', strategy='greedy', length=5, redis_args=args) print(fetcher.get_proxy()) print(fetcher.get_proxies()) # or print(fetcher.pool) ``` or ```python3 import requests proxies = {'https': 'http://127.0.0.1:3128'} resp = requests.get('https://httpbin.org/ip', proxies=proxies) print(resp.text) ``` # WorkFlow ![](static/workflow.png) # Other important things - This project is highly dependent on redis,if you want to replace redis with another mq or database, just do it at your own risk - If there is no Great Fire Wall at your country,set`proxy_mode=0` in both [gfw_spider.py](crawler/spiders/gfw_spider.py) and [ajax_gfw_spider.py](crawler/spiders/ajax_gfw_spider.py). If you don't want to crawl some websites, set `enable=0` in [rules.py](config/rules.py) - Becase of the Great Fire Wall in China, some proxy ip may can't be used to crawl some websites such as Google.You can extend the proxy pool by yourself in [spiders](crawler/spiders) - Issues and PRs are welcome - Just star it if it's useful to you # Test Result Here are test results for crawling https://zhihu.com using `haipproxy`.Source Code can be seen [here](examples/zhihu) |requests|time|cost|strategy|client| |-----|----|---|---------|-----| |0|2018/03/03 22:03|0|greedy|[py_cli](client/py_cli.py)| |10000|2018/03/03 11:03|1 hour|greedy|[py_cli](client/py_cli.py)| |20000|2018/03/04 00:08|2 hours|greedy|[py_cli](client/py_cli.py)| |30000|2018/03/04 01:02|3 hours|greedy|[py_cli](client/py_cli.py)| |40000|2018/03/04 02:15|4 hours|greedy|[py_cli](client/py_cli.py)| |50000|2018/03/04 03:03|5 hours|greedy|[py_cli](client/py_cli.py)| |60000|2018/03/04 05:18|7 hours|greedy|[py_cli](client/py_cli.py)| |70000|2018/03/04 07:11|9 hours|greedy|[py_cli](client/py_cli.py)| |80000|2018/03/04 08:43|11 hours|greedy|[py_cli](client/py_cli.py)| # Reference Thanks to all the contributors of the following projects. [dungproxy](https://github.com/virjar/dungproxy) [proxyspider](https://github.com/zhangchenchen/proxyspider) [ProxyPool](https://github.com/henson/ProxyPool) [proxy_pool](https://github.com/jhao104/proxy_pool) [ProxyPool](https://github.com/WiseDoge/ProxyPool) [IPProxyTool](https://github.com/awolfly9/IPProxyTool) [IPProxyPool](https://github.com/qiyeboy/IPProxyPool) [proxy_list](https://github.com/gavin66/proxy_list) [proxy_pool](https://github.com/lujqme/proxy_pool)