# beeize-scraper-example **Repository Path**: beeize_enterprise/beeize-scraper-example ## Basic Information - **Project Name**: beeize-scraper-example - **Description**: beeize-scraper-example - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 2 - **Created**: 2024-04-08 - **Last Updated**: 2025-02-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 快速开始 ## 开发流程 ### 项目目录结构参考 ``` beeize-scraper-example/ requirements.txt input_schema.json Dockerfile output_schema.json README.md main.py ``` ### sdk 输出目录结构参考 ``` storage/ datasets/ default/ 000000001.json 000000002.json 000000003.json 000000004.json __metadata__.json kv_stores/ default/ demo1.mp4 demo2.mp4 demo3.mp4 demo4.mp4 __metadata__.json request_queues/ default/ 8m3Ssk32vNgBp4p.json 9NQYbiNlWlaJYci.json 09nTuJT7y87FGXs.json EmPwJk4NaJQPzpS.json __metadata__.json ``` ### main文件示例 ```python import json import os import re import requests from beeize.scraper import Scraper from loguru import logger from parsel import Selector headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (' 'KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0', } column_maps = { '第一关注': '/first/care/diyi', '中国': '/generalColumns/zhongguo', '国际': '/generalColumns/gj', '观点': '/generalColumns/guandian' } url_maps = { '/first/care/diyi': 'https://www.cankaoxiaoxi.com/json/channel/diyi/list.json', '/generalColumns/zhongguo': 'https://www.cankaoxiaoxi.com/json/channel/zhongguo/list.json', '/generalColumns/gj': 'https://www.cankaoxiaoxi.com/json/channel/gj/list.json', '/generalColumns/guandian': 'https://www.cankaoxiaoxi.com/json/channel/guandian/list.json', } class CanKaoXiaoXi: def __init__(self): self.scraper = Scraper() self._input = self.scraper.input self.queue = self.scraper.request_queue # 请求队列集 self.kv_store = self.scraper.key_value_store # 键值存储集 self.start_urls = self._input.get_list('start_urls') # 从 array 类型读取 self.page_number = self._input.get_int('page_number') # 从 integer 类型读取 self.download_media = self._input.get_bool('download_media') # 从 boolean 类型读取 self.column = self._input.get_string('column') # 从 string 类型读取 self.proxies = { 'http': self._input.get_random_proxy(), 'https': self._input.get_random_proxy(), } # 读取代理配置 def fetch(self, url, retry_count=0): try: response = requests.get( url=url, headers=headers, proxies=self.proxies, ) return response.text except (Exception,): if retry_count < 3: return self.fetch(url, retry_count + 1) def add_task(self): # 从数组获取URL for start_url in self.start_urls: url = start_url.get('url') api_url = url_maps.get(url.split('#')[-1]) request = {'url': api_url, 'type': 'list_page', 'page_number': 1, } self.queue.add_request(request) api_url = url_maps.get(column_maps.get(self.column)) request = {'url': api_url, 'type': 'list_page', 'page_number': 1, } self.queue.add_request(request) def run(self): self.add_task() while self.queue.is_finished(): # 取任务 request = self.queue.fetch_next_request() # 下载 logger.info(request.get('url')) # 日志 if request['type'] == 'list_page': list_item = json.loads(self.fetch(request.get('url'))).get('list') for item in list_item: data = item.get('data') detail_request = {'url': data.get('url'), 'type': 'detail_page', 'data': data} self.queue.add_request(detail_request) if request.get('page_number') < self.page_number: request['page_number'] = request.get('page_number') + 1 request['url'] = request.get('url') + f"#{request['page_number']}" self.queue.add_request(request) if request['type'] == 'detail_page': content_txt = re.findall(r'var contentTxt =\"(.*)\";', self.fetch(request.get('url'))) if not content_txt: continue content_txt = content_txt[0] content_txt = re.sub(r'\\', '', content_txt) content = Selector(text=content_txt).xpath('string(.)').extract_first() result = request['data'] result['content'] = content self.scraper.push_data(result) if __name__ == '__main__': CanKaoXiaoXi().run() ``` ### Dockerfile示例 ```Dockerfile # 使用 Python 官方镜像作为基础镜像 # FROM python:3.8 # 使用阿里的 Python 镜像作为基础镜像(国内把 dockerhub 禁掉了,拉不下来镜像) # 如果代码需要 node 环境运行 js,可以把基础镜像替换成 registry.cn-shanghai.aliyuncs.com/beeize-public/python-nodejs:python3.11-nodejs22 FROM alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/python:3.11.1 # 设置工作目录 WORKDIR /app # 复制项目文件到容器中 COPY . /app # 安装依赖 RUN pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple # 设置环境变量 ENV TZ=Asia/Shanghai # 设置容器启动时执行的命令 CMD python main.py ``` ### input_schema 环境变量的输入由input_schema.json控制导入。 > 注意:输入项会在的 key 在转换为环境变量时,会变成大写。 > > 例如,input_schema.json 的 key 为 user_name ,则环境变量为 USER_NAME。 > {style="warning"} #### 基本输入 ```json { "title": "你的应用名称", "type": "object", "description": "输入配置", "schemaVersion": 1, "properties": { "提供给用户输入的key name": { "sectionCaption": "章节标题-折叠", "title": "key的作用", "type": "变量类型", "description": "描述", "prefill": "预设值", "required": true } } } ``` #### array类型 **输入** ```json "start_urls": { "sectionCaption": "基本配置", "title": "起始 URL", "type": "array", "description": "爬虫开始抓取的初始 URL 列表。", "prefill": [ { "url": "https://beeize.com" } ], "editor": "requestListSources", "required": true } ``` **输出** ``` START_URLS='[{"url":"http://beeize.com"}]' ``` **python读取** ``` json.loads(os.getenv('START_URLS')) ``` #### boolean类型 **输入** ``` "url_extract": { "title": "链接提取", "type": "boolean", "description": "描述", "prefill": true } ``` **输出** ``` URL_EXTRACT='true' ``` **python读取** ``` os.getenv('URL_EXTRACT')=='true' ``` #### string类型 **输入** ``` "title_xpath": { "title": "标题提取", "type": "string", "description": "描述", "editor": "textfield", "prefill": null } ``` **输出** ``` TITLE_XPATH='//text()' ``` **python读取** ``` os.getenv('TITLE_XPATH') ``` #### integer类型 **输入** ``` "max_page": { "title": "最大翻页数", "type": "integer", "description": "描述", "minimum": 0, "prefill": null }, ``` **输出** ``` MAX_PAGE='0' ``` **python读取** ``` int(os.getenv('MAX_PAGE', 1)) ``` ### output_schema #### 支持的组件类型(component 字段) 目前只支持 table #### 支持的样式类型(type 字段) | 样式类型 | 数据类型 | 展示样式 | |--------|------------|----------------------------------------| | string | 文本 | 直接展示 | | number | 数字,包括浮点型数字 | 直接展示 | | link | 链接 | 可点击在新窗口打开此链接 | | image | 图片 | 图片样式展示 | | file | 文件 | 视频样式展示,对于其他格式的文件,比如pdf、word 等,建议用 link | | bool | 布尔型 | 展示 ❎ 或者 ✅ | | array | 数组 | 展示 item 个数,可点击展开详情,json详情 | | object | 对象 | 可点击展开详情,json 展示 | #### 示例 ```json { "component": "table", "columns": { "title": { "title": "标题", "type": "string" }, "publishTime": { "title": "发布时间", "type": "string" }, "content": { "title": "正文", "type": "string" }, "userId": { "title": "用户ID", "type": "string" }, "userName": { "title": "用户名", "type": "string" }, "url": { "title": "文章链接", "type": "link" } } } ``` ### 编写README 开发者告诉用户如何使用采集器的途径 输入内容的来源,如何填写,输出的内容 ## 发布流程 1. 点击采集器-发布(蓝色按钮) 2. 点选-项目的git仓库 3. 跳转-输入账号密码登陆 4. 点击-同意授权 5. 点选-选择您的采集器所在的仓库 6. 采集器创建成功,去构建 7. 构建状态-成功 8. 发布(最后一个)-修改采集器名称,上传图标,添加描述,添加分类-保存(蓝色按钮) 9. 点击-发布(绿色按钮)