# funspider

**Repository Path**: sugarysp/funspider

## Basic Information

- **Project Name**: funspider
- **Description**: 基于funboost和feapder融合写的练手项目
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-17
- **Last Updated**: 2026-03-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


## 注意：
该框架未经线上测试，目前有个更好的替代。
放弃了封装框架core的发布，因为最初是为了解决 编排式的任务，现在funboost 有了更好的支持。https://funboost.readthedocs.io/zh-cn/latest/articles/c4b.html#b-8-funboost-workfolw

funboost的  boost_spider 基本够用。就是把数据库入库的地方加上这里的缓冲入库，避免每次单条insert一丝丝优化。直接用就可以了。


# {{ cookiecutter.project_name }}

> 基于 FunSpider 框架的爬虫项目

## 📝 项目信息

- **项目名称**: {{ cookiecutter.project_name }}
- **创建时间**: {% now 'local', '%Y-%m-%d' %}
- **Python 版本**: 3.7+
- **框架**: FunSpider (Feapder + Funboost)

## 🚀 快速开始

### 1. 安装依赖

```bash
pip install -r requirements.txt
```

### 2. 配置数据库

编辑 `funboost_config.py` 文件，配置数据库连接：

```python
class BrokerConnConfig(_DefaultBroker):
    """中间件连接配置"""
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    
    MYSQL_HOST = '127.0.0.1'
    MYSQL_PORT = 3306
    MYSQL_USER = 'root'
    MYSQL_PASSWORD = 'your_password'
    MYSQL_DATABASE = 'spider_db'

class FunspiderSettings(_DefaultSettings):
    """爬虫配置"""
    PROJECT_NAME = '{{ cookiecutter.project_name }}'
    BATCH_SIZE = 2000
    LOG_LEVEL = 'INFO'
```

### 3. 创建爬虫

```bash
# 在 spiders 目录下创建新爬虫
funspider create -s my_spider
```

### 4. 编写爬虫代码

```python
# spiders/my_spider.py
from funspider import BaseSpider, Request

class MySpider(BaseSpider):
    name = 'my_spider'
    
    def start_requests(self):
        """爬虫入口"""
        yield Request(
            url='https://example.com',
            callback=self.parse
        )
    
    def parse(self, request, response):
        """解析页面"""
        title = response.xpath('//title/text()').extract_first()
        
        # 返回数据（自动批量入库）
        yield {
            'table': 'items',
            'url': request.url,
            'title': title,
        }

if __name__ == '__main__':
    spider = MySpider()
    spider.start()
```

### 5. 运行爬虫

```bash
# 单进程运行
python spiders/my_spider.py

# 或在 PyCharm 中直接运行
# 右键 my_spider.py -> Run
```

## 📁 项目结构

```
{{ cookiecutter.project_name }}/
├── spiders/                    # 爬虫模块
│   ├── __init__.py
│   └── demo_spider.py          # 示例爬虫
├── funboost_config.py          # 项目配置文件
├── nb_log_config.py            # 日志配置文件
├── pyproject.toml              # Python 项目配置
├── requirements.txt            # 依赖列表
└── README.md                   # 本文件
```

## ⚙️ 配置说明

### funboost_config.py 配置优先级

```
框架默认配置 (funspider/funboost_config.py)
       ↓
用户项目配置 (funboost_config.py)  ← 当前文件
       ↓
爬虫自定义配置 (Spider.__custom_setting__)
```

### 常用配置项

#### 数据库配置
```python
class BrokerConnConfig(_DefaultBroker):
    # Redis
    REDIS_HOST = 'localhost'
    REDIS_PORT = 6379
    REDIS_PASSWORD = ''
    
    # MySQL
    MYSQL_HOST = 'localhost'
    MYSQL_PORT = 3306
    MYSQL_USER = 'root'
    MYSQL_PASSWORD = ''
    MYSQL_DATABASE = 'spider'
    
    # MongoDB
    MONGO_CONNECT_URL = 'mongodb://localhost:27017'
```

#### 爬虫配置
```python
class FunspiderSettings(_DefaultSettings):
    # 项目名称
    PROJECT_NAME = '{{ cookiecutter.project_name }}'
    
    # 批量处理
    BATCH_SIZE = 2000              # 批量大小
    ITEM_FLUSH_INTERVAL = 60       # 刷新间隔（秒）
    
    # Pipeline 配置
    ITEM_PIPELINES = {
        'funspider.pipelines.mysql_pipeline.MysqlPipeline': 300,
    }
    
    # 日志级别
    LOG_LEVEL = 'INFO'  # DEBUG/INFO/WARNING/ERROR
    
    # 分布式配置
    CONSUMER_COUNT = 1  # 消费者数量
    BROKER_KIND = 'redis_ack_able'  # 消息队列类型
```

## 🎯 使用示例

### 示例 1：基础爬虫

```python
class BasicSpider(BaseSpider):
    name = 'basic'
    
    def start_requests(self):
        yield Request('https://example.com', callback=self.parse)
    
    def parse(self, request, response):
        yield {
            'table': 'pages',
            'url': request.url,
            'title': response.xpath('//title/text()').extract_first(),
        }
```

### 示例 2：带去重的爬虫

```python
class DedupeSpider(BaseSpider):
    name = 'dedupe'
    
    def start_requests(self):
        yield Request(
            url='https://example.com',
            filter_str='unique_key',  # 去重标识
            callback=self.parse
        )
```

### 示例 3：自定义配置的爬虫

```python
class CustomSpider(BaseSpider):
    name = 'custom'
    
    # 自定义配置（优先级最高）
    __custom_setting__ = {
        'BATCH_SIZE': 5000,
        'LOG_LEVEL': 'DEBUG',
        'CONSUMER_COUNT': 4,
    }
```

## 🔧 高级特性

### 分布式运行

在多台机器上执行相同命令，自动实现分布式采集：

```bash
# 机器1
python spiders/my_spider.py

# 机器2
python spiders/my_spider.py

# 机器3
python spiders/my_spider.py
```

### 失败重试

框架会自动处理失败数据，存储到 Redis 并支持重试：

```python
__custom_setting__ = {
    'ITEM_MAX_RETRY_TIMES': 3,  # 最大重试 3 次
}
```

### 性能监控

可选集成 InfluxDB 进行性能监控：

```python
__custom_setting__ = {
    'ENABLE_METRICS': True,
    'INFLUXDB_HOST': 'localhost',
    'INFLUXDB_PORT': 8086,
}
```

## 📊 数据管道

### 使用 MySQL Pipeline

```python
ITEM_PIPELINES = {
    'funspider.pipelines.mysql_pipeline.MysqlPipeline': 300,
}

# 返回数据时指定表名
yield {
    'table': 'my_table',
    'field1': 'value1',
    'field2': 'value2',
}
```

### 使用 MongoDB Pipeline

```python
ITEM_PIPELINES = {
    'funspider.pipelines.mongo_pipeline.MongoPipeline': 300,
}

# 返回数据时指定集合
yield {
    'table': 'my_collection',
    'data': {...},
}
```

### 使用多个 Pipeline

```python
ITEM_PIPELINES = {
    'funspider.pipelines.mysql_pipeline.MysqlPipeline': 300,
    'funspider.pipelines.mongo_pipeline.MongoPipeline': 310,
    'funspider.pipelines.es_pipeline.ElasticsearchPipeline': 320,
}
```

## 📝 日志

### 使用默认 Logger

```python
from funspider.utils.fun_logger import logger

logger.info('爬虫启动')
logger.warning('警告信息')
logger.error('错误信息')
```

### 创建自定义 Logger

```python
from funspider.utils.fun_logger import get_logger

logger = get_logger('my_spider')
logger.info('自定义日志')
```

## 🐛 调试技巧

### 1. 设置 DEBUG 日志

```python
__custom_setting__ = {
    'LOG_LEVEL': 'DEBUG',
}
```

### 2. 查看 ItemBuffer 状态

查看日志中的批量入库信息，确认数据是否正常处理。

### 3. 检查 Redis 队列

```bash
redis-cli
> KEYS *{{ cookiecutter.project_name }}*
> LLEN queue_name
```

## 📚 学习资源

- [FunSpider 文档](https://github.com/your-repo/funspider)
- [Funboost 文档](https://funboost.readthedocs.io/)
- [Feapder 文档](https://geekdaxue.co/read/feapder-doc/)

## ❓ 常见问题

### Q: 数据为什么没有立即入库？
A: ItemBuffer 会批量入库，等到达 BATCH_SIZE 或 ITEM_FLUSH_INTERVAL 时才会写入。

### Q: 如何切换消息队列？
A: 修改 `BROKER_KIND` 配置，支持 redis/redis_ack_able/rabbitmq/kafka 等。

### Q: 如何实现去重？
A: 使用 `Request` 的 `filter_str` 参数，框架会自动基于 Redis 去重。

## 📄 许可证

MIT License

---

**Happy Coding! 🎉**