# LatestValidProxies

**Repository Path**: CodexploRe/LatestValidProxies

## Basic Information

- **Project Name**: LatestValidProxies
- **Description**: LatestValidProxies为ChenUtils包中的其中一个模块。该模块作用是方便本人的爬虫学习，用来获取当前的有效的匿名代理ip。
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 2
- **Forks**: 0
- **Created**: 2023-08-27
- **Last Updated**: 2025-03-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# LatestValidProxies

## 介绍
这是本人的包（ChenUtils）中的其中一个包。该包作用是方便本人的爬虫学习，用来获取当前的有效的匿名代理ip。初学python，代码仅供参考。关于ChenUtils包的更多信息，欢迎到[我的项目地址](https://pypi.org/project/ChenUtils/)参观。

## 导入包

请使用cmd命令行窗口运行以下指令，在此之前确保您拥有python运行环境和pip工具：

```cmd
pip install ChenUtils
```

之后在python环境中`from ChenUtils.LatestValidProxies import `语句获取您需要的功能即可。

## 使用说明

##### 1.获取一个有效的匿名代理ip

Spiders.py提供五个本人写的对应ip网址的爬虫类：
```python
BeesProxySpider  # www.beesproxy.com
Ip89Spider  # www.89ip.cn
KuaidailiSpider  # www.kuaidaili
Ip66Spider  # www.66ip.cn
IhuanSpider  # ip.ihuan.me
```

直接实例化对象并调用get_one_useful_proxy方法即可，返回的是Proxy对象，您可以通过属性获取ip、port、speed等信息。
建议使用BeesProxySpider，ip相对质量较高，获取速度较快。参考代码如下：

```python
from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider
# from LatestValidProxies.Proxy import Proxy


beeproxy_spider = BeesProxySpider()
proxy = beeproxy_spider.get_one_useful_proxy()
# print(proxy.__dict__)
print(proxy.ip, proxy.port)
```

##### 2.获取包含多个有效高匿代理ip的列表

对对象调用get_useful_proxies方法即可，返回的是Proxy对象列表，默认count参数为sys.maxsize，在未设置count参数时将爬取默认页数的有效代理ip并组装成list返回。参考代码如下：

```python
from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider


beesproxy_spider = BeesProxySpider()
for proxy in beesproxy_spider.get_useful_proxies():
    print(proxy.ip, proxy.port)
```

鉴于不传递count参数时会按照self.max_pages属性爬取对应页数的代理ip并进行检验，耗费的时间较多，本人并不推荐此类用法。相对的，我更推荐您将需要获取的有效代理ip数量传递到count参数中去，这会为您大大减少等待的时间，提高工作的效率。参考代码如下：

```python
for proxy in beesproxy_spider.get_useful_proxies(5):
    print(proxy.ip, proxy.port)
```

##### 3.短时间内获取结果不同的代理

虽然代理ip网站的数据每隔一段时间就会更新，但是在爬虫任务进行过程中可能会遇到代理ip被屏蔽的情况，此时如果短时间再调用get_one_useful_proxy方法，则很大概率会获取到同一个代理ip（因为此方法和get_useful_proxies方法都是按网页顺序爬取代理ip的）。为此，ChenUtils-0.0.7及以后的版本为这两个方法添加了参数is_random。此参数默认值为False, 当传入True时会随机打乱urls列表, 通过改变解析代理网址页面的顺序来获取和之前调用时的获取结果不同的代理ip。参考代码如下：

```python
from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider


beesproxy_spider = BeesProxySpider()
proxy = beesproxy_spider.get_one_useful_proxy(is_random=True)
```


##### 4.使用更多参数以实例化针对需求的爬虫对象

BaseSpider的子类为有更具体需求的用户提供了多个参数，其父类init文件定义如下：

```python
class BaseSpider(object):
    def __init__(self, urls, group_xpath, detail_xpath, show_logs, max_pages,
                 encoding, max_worker, datetime_format, highest_latency):
        self.urls = urls
        self.group_xpath = group_xpath
        self.detail_xpath = detail_xpath
        self._to_end = False
        self.show_logs = show_logs
        self.max_pages = max_pages
        self.encoding = encoding
        self.max_worker = max_worker
        self.datetime_format = datetime_format
        self.test_timeout = highest_latency
```

您可以在子类中使用以下参数来满足您的特定需求，各参数解释如下：

* **highest_latency**：

  设置获取的代理ip的最大延迟要求

* **max_pages**：

  修改默认爬取的最大页数

* **max_worker**：

  设置爬虫的最大线程数
  
* **encoding**：

  修改解码方式，当具体爬虫类获取的代理地址显示为中文乱码可以在此修改解码方式

* **datetime_format**：

  设置代理ip时间保持的格式

* **show_logs**：
  Spider类的参数show_logs的默认值为False，即不显示爬虫运行日志。如果您需要了解爬虫运行的进度，可以选择向当中传入True，就如下面参考代码所示：

  ```python
  beesproxy_spider = BeesProxySpider(show_logs=True)
  proxy = beesproxy_spider.get_one_useful_proxy()
  print(proxy.ip, proxy.port)
  ```

  以上参考代码会让您的日志对象在终端中输出类似以下的内容：

  ```python
  2023-08-30 22:00:56 Spider.py [line:25] INFO: Trying to establish a connection with https://www.beesproxy.com/free/page/1
  2023-08-30 22:00:58 Spider.py [line:36] INFO: Collecting web proxy IP information
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 60.205.132.71:80
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 183.230.162.122:9091
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 111.43.105.50:9091
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 61.133.66.69:9002
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 112.250.110.172:9091
  2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 111.20.217.178:9091
  2023-08-30 22:01:02 Spider.py [line:78] INFO: Detecting proxy ip: 120.196.188.21:9091
  2023-08-30 22:01:02 Spider.py [line:98] INFO: Found a valid anonymous proxy ip: 61.133.66.69:9002
  2023-08-30 22:01:02 Spider.py [line:102] INFO: This crawling proxy IP takes: 5.9 s, waiting for the end of other threads, estimated time consuming 0~5 s
  61.133.66.69 9002
  ```

##### 5.添加对自定义网站的爬取

首要的就是构造目标网站的具体爬虫类，具体分为以下几个步骤：

1. 继承BaseSpider类
2. 根据网站链接特点构造urls列表
3. 获取具体网站的表格的分组xpath和组内xpath
4. 按参考代码的格式构造具体爬虫类

参考代码如下：

```python
from ChenUtils.LatestValidProxies.BaseSpider import BaseSpider


class XXSpider(BaseSpider):
    def __init__(self, show_logs=False, max_pages=10, encoding=None,
                 max_worker=5, datetime_format='%Y-%m-%d %H:%M:%S', highest_latency=5):
        urls = [f'https://xxxxx/{page}' for page in range(1, max_pages + 1)]
        group_xpath = '//*[@id="xxxx"]/xxxx/table/tbody/tr'  # 整个代理ip表格的xpath
        detail_xpath = {
            'ip': './td[1]/text()',  # ip格子文字的xpath
            'port': './td[2]/text()',  # port格子文字的xpath
            'area': './td[3]/text()'  # area格子文字的xpath
        }
        super().__init__(urls=urls, group_xpath=group_xpath, detail_xpath=detail_xpath,
                         show_logs=show_logs, max_pages=max_pages, encoding=encoding,
                         max_worker=max_worker, datetime_format=datetime_format, highest_latency=highest_latency)
```

之后参考上方的使用说明通过实例化对象调用方法爬取即可。当遇上特定网站的特定问题需要处理时，可以在具体爬虫类中重写方法或者添加方法等进行修正。