# data-reptile

**Repository Path**: data-asset-management/data-reptile

## Basic Information

- **Project Name**: data-reptile
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-26
- **Last Updated**: 2026-05-26

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 北京数据交易所 Scrapy 多爬虫项目

这是一个支持多个爬虫的Scrapy项目，目前包含北京国际大数据交易所爬虫。

## 项目结构

```
beijing_scrapy/
├── scrapy.cfg
├── beijing_scrapy/
│   ├── settings.py              # 全局配置
│   ├── items.py                 # 数据模型（可定义多个Item）
│   ├── pipelines.py             # 数据管道（支持多个爬虫共用）
│   ├── middlewares.py           # 中间件（支持多个爬虫共用）
│   ├── database.py              # 数据库连接
│   └── spiders/
│       ├── beijing_spider.py    # 北京爬虫
│       ├── example_spider.py    # 示例爬虫模板
│       └── ...                  # 可以添加更多爬虫
```

## 当前可用的爬虫

### 1. beijing - 北京国际大数据交易所爬虫

```bash
scrapy crawl beijing
```

**特点**：
- 爬取北京国际大数据交易所的产品数据
- POST请求获取列表
- GET请求获取详情
- 使用BeautifulSoup解析HTML内容

## 如何运行爬虫

### 运行指定爬虫

```bash
# 运行北京爬虫
scrapy crawl beijing

# 指定爬取页数
scrapy crawl beijing -a total_page=5

# 保存结果为JSON
scrapy crawl beijing -o output.json
```

### 查看所有可用爬虫

```bash
scrapy list
```

输出示例：
```
beijing
example
```

### 同时运行多个爬虫

```bash
# 方式1: 分别运行
scrapy crawl beijing &
scrapy crawl example &

# 方式2: 使用CrawlerProcess（需要编写脚本）
```

## 如何添加新爬虫

### 步骤1: 复制模板

```bash
cd data_reptile/data_reptile/spiders
cp example_spider.py new_spider.py
```

### 步骤2: 修改爬虫文件

编辑 `new_spider.py`：

```python
class NewSpider(scrapy.Spider):
    # 修改爬虫名称（必须唯一）
    name = 'new_spider'
    
    # 修改允许的域名
    allowed_domains = ['new-domain.com']
    
    # 修改API地址
    list_url = 'https://api.new-domain.com/list'
    detail_url = 'https://api.new-domain.com/detail'
    
    # 实现parse_list和parse_detail方法
    # ...
```

### 步骤3: （可选）创建新的Item

如果新爬虫的数据结构与现有不同，在 `items.py` 中添加新的Item类：

```python
class NewItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...
```

### 步骤4: （可选）创建专用的Pipeline

如果需要特殊的数据处理，在 `pipelines.py` 中添加新的Pipeline：

```python
class NewPipeline:
    def process_item(self, item, spider):
        if spider.name == 'new_spider':
            # 特殊处理逻辑
            pass
        return item
```

然后在 `settings.py` 中注册：

```python
ITEM_PIPELINES = {
    'data_reptile.pipelines.BeijingPipeline': 300,
    'data_reptile.pipelines.NewPipeline': 310,
}
```

### 步骤5: 测试新爬虫

```bash
# 列出所有爬虫，确认新爬虫已注册
scrapy list

# 运行新爬虫
scrapy crawl new_spider
```

## 多爬虫最佳实践

### 1. Item设计

**方案A: 共用Item**（推荐用于数据结构相似的爬虫）

```python
# items.py
class UniversalItem(scrapy.Item):
    source_id = scrapy.Field()
    product_name = scrapy.Field()
    # ... 通用字段
```

**方案B: 独立Item**（推荐用于数据结构差异大的爬虫）

```python
# items.py
class BeijingItem(scrapy.Item):
    # 北京特有字段
    pass

class OtherItem(scrapy.Item):
    # 其他爬虫特有字段
    pass
```

### 2. Pipeline设计

**根据爬虫名称区分处理**：

```python
class UniversalPipeline:
    def process_item(self, item, spider):
        if spider.name == 'beijing':
            # 处理北京爬虫数据
            table_name = 'beijing_dep'
        elif spider.name == 'other':
            # 处理其他爬虫数据
            table_name = 'other_dep'
        
        # 统一的存储逻辑
        self.save_to_db(item, table_name)
        return item
```

### 3. 配置管理

**为不同爬虫设置不同的配置**：

```python
# settings.py

# 全局配置
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 3

# 特定爬虫配置（在爬虫类中设置）
custom_settings = {
    'DOWNLOAD_DELAY': 5,
    'CONCURRENT_REQUESTS': 4,
}
```

### 4. 日志分离

```python
# 在爬虫中设置
custom_settings = {
    'LOG_FILE': f'logs/{spider_name}.log',
}
```

## 常用命令速查

```bash
# 查看所有爬虫
scrapy list

# 运行指定爬虫
scrapy crawl <spider_name>

# 查看爬虫信息
scrapy parse --spider=<spider_name> <url>

# 检查settings
scrapy settings --get BOT_NAME

# 运行并保存结果
scrapy crawl beijing -o beijing.json
scrapy crawl example -o example.csv

# 调整并发
scrapy crawl beijing -s CONCURRENT_REQUESTS=4

# 调试模式
scrapy crawl beijing -L DEBUG
```

## 示例：添加北部湾爬虫到此项目

如果你想把beibu爬虫也整合到这个项目中：

1. 在 `items.py` 中添加 `BeibuItem`
2. 在 `spiders/` 目录下创建 `beibu_spider.py`
3. 在 `pipelines.py` 中添加对beibu的支持
4. 运行：`scrapy crawl beibu`

这样你就可以在一个项目中管理多个交易所的爬虫了！

## 注意事项

1. **爬虫名称唯一**: 每个爬虫的 `name` 属性必须唯一
2. **Item复用**: 相似结构的爬虫可以共用Item
3. **Pipeline路由**: 根据 `spider.name` 判断数据来源
4. **配置隔离**: 不同爬虫可以有独立的配置
5. **日志分离**: 建议为每个爬虫设置独立的日志文件

## 优势

✅ **统一管理**: 所有爬虫在一个项目中  
✅ **共享代码**: 共用Pipeline、Middleware等  
✅ **易于维护**: 统一的配置和依赖管理  
✅ **灵活扩展**: 轻松添加新爬虫  
✅ **资源优化**: 可以协调多个爬虫的资源使用  

---

**祝爬取顺利！** 🚀