登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
代码拉取完成,页面将自动刷新
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
1
Star
0
Fork
0
张诚坤
/
数据采集融合
代码
Issues
5
Pull Requests
0
Wiki
统计
流水线
服务
JavaDoc
PHPDoc
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
第三次作业
待办的
#IBBX6P
张诚坤
拥有者
创建于
2024-12-17 18:07
### 作业①:天气图片爬虫 #### 1. 核心代码 - **`weather_spider.py`** ```python import scrapy from scrapy import Request from scrapy.crawler import CrawlerProcess from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem import sys sys.path.append('D:\\张诚坤aaa老母猪批发\\weather_images') from weather_images.pipelines import WeatherImagesPipeline class WeatherSpider(scrapy.Spider): name = 'weather_spider' allowed_domains = ['weather.com.cn'] start_urls = ['http://weather.com.cn/'] def parse(self, response): images = response.css('img::attr(src)').getall() for image_url in images: if image_url.startswith('http'): yield Request(image_url, callback=self.save_image) def save_image(self, response): image_guid = response.url.split('/')[-1] yield { 'image_urls': [response.url], 'image_name': image_guid } class WeatherImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item process = CrawlerProcess(settings={ 'ITEM_PIPELINES': {'weather_images.pipelines.WeatherImagesPipeline': 1}, 'IMAGES_STORE': 'D:\\张诚坤aaa老母猪批发\\images', 'CONCURRENT_REQUESTS': 22, 'DOWNLOAD_DELAY': 1, 'CLOSESPIDER_ITEMCOUNT': 122, 'CLOSESPIDER_PAGECOUNT': 22, }) process.crawl(WeatherSpider) process.start() ``` - **`pipelines.py`** ```python import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class WeatherImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item ``` - **`settings.py`** ```python BOT_NAME = 'weather_images' SPIDER_MODULES = ['weather_images.spiders'] NEWSPIDER_MODULE = 'weather_images.spiders' ITEM_PIPELINES = { 'weather_images.pipelines.WeatherImagesPipeline': 1, } IMAGES_STORE = 'D:\\张诚坤aaa老母猪批发\\images' CONCURRENT_REQUESTS = 32 # 多线程 # CONCURRENT_REQUESTS = 1 # 单线程 DOWNLOAD_DELAY = 1 CLOSESPIDER_ITEMCOUNT = 122 CLOSESPIDER_PAGECOUNT = 22 ROBOTSTXT_OBEY = True ``` #### 2. 实验心得 在编写这个天气图片爬虫的过程中,深刻体会到Scrapy框架的强大与灵活性。它简化了网页数据抓取和图片下载过程,通过异步处理机制提高效率。利用Scrapy的`ImagesPipeline`能轻松提取图片URL并下载存储图片,同时配置并发请求数量和下载延迟,可有效控制爬虫对目标网站的压力。Scrapy的中间件和管道机制让数据处理和存储变得很灵活,提升了爬虫的可扩展性和可维护性。 ### 作业②:股票数据爬虫 #### 1. 核心代码 - **`stocks.py`** ```python import scrapy import json from scrapy.exceptions import CloseSpider from stock_scraper.items import StockScraperItem class StockSpider(scrapy.Spider): name = 'stocks' allowed_domains = ['eastmoney.com'] start_urls = [ 'https://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112409840494931556277_1633338445629&pn=1&pz=10&po=1&np=1&fltt=2&invt=2&fid=f3&fs=b:MK0021&fields=f12,f14,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f18,f15,f16,f17,f23' ] def parse(self, response): try: json_response = response.json() data = json_response.get('data', {}) stock_diff = data.get('diff', []) if not stock_diff: self.logger.warning("No stock data found in the response.") return for stock in stock_diff: item = StockScraperItem() item['stock_code'] = stock.get('f12', 'N/A') item['stock_name'] = stock.get('f14', 'N/A') item['latest_price'] = self._parse_float(stock, 'f2', 0.0) item['change_percent'] = self._parse_float(stock, 'f3', 0.0) item['change_amount'] = self._parse_float(stock, 'f4', 0.0) item['volume'] = stock.get('f5', '0') item['turnover'] = stock.get('f6', '0') item['amplitude'] = self._parse_float(stock, 'f7', 0.0) item['high'] = self._parse_float(stock, 'f15', 0.0) item['low'] = self._parse_float(stock, 'f16', 0.0) item['open_price'] = self._parse_float(stock, 'f17', 0.0) item['yesterday_close'] = self._parse_float(stock, 'f18', 0.0) yield item except json.JSONDecodeError as e: self.logger.error(f"Failed to parse JSON: {e}") raise CloseSpider("Invalid JSON response") except Exception as e: self.logger.error(f"An error occurred: {e}") raise CloseSpider("An unexpected error occurred") def _parse_float(self, stock_data, key, default=0.0): value = stock_data.get(key) if value is None: return default try: return float(value) except (ValueError, TypeError): self.logger.warning(f"Invalid value for key '{key}': {value}") return default ``` - **`pipelines.py`** ```python import mysql.connector from scrapy.exceptions import DropItem import pymysql class MySQLPipeline: def open_spider(self, spider): self.con = pymysql.connect(host="localhost", port=3306, user="root", passwd="192837465", db="stocks", charset="utf8") self.cursor = self.con.cursor(pymysql.cursors.DictCursor) self.opened = True self.count = 0 self.cursor.execute(''' CREATE TABLE IF NOT EXISTS stock ( stock_code VARCHAR(20), stock_name VARCHAR(255), latest_price FLOAT, change_percent FLOAT, change_amount FLOAT, volume VARCHAR(20), turnover VARCHAR(20), amplitude FLOAT, high FLOAT, low FLOAT, open_price FLOAT, yesterday_close FLOAT, PRIMARY KEY(stock_code) ) ''') self.con.commit() def close_spider(self, spider): self.cursor.close() self.con.close() def process_item(self, item, spider): try: self.cursor.execute(''' REPLACE INTO stock (stock_code, stock_name, latest_price, change_percent, change_amount, volume, turnover, amplitude, high, low, open_price, yesterday_close) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ''', ( item['stock_code'], item['stock_name'], item['latest_price'], item['change_percent'], item['change_amount'], item['volume'], item['turnover'], item['amplitude'], item['high'], item['low'], item['open_price'], item['yesterday_close'] )) self.con.commit() except mysql.connector.Error as e: spider.logger.error(f"Error saving item to MySQL: {e}") raise DropItem(f"Error saving item: {e}") return item ``` - **`settings.py`** ```python # Scrapy settings for stock_scraper project BOT_NAME = 'stock_scraper' SPIDER_MODULES = ['stock_scraper.spiders'] NEWSPIDER_MODULE = 'stock_scraper.spiders' ITEM_PIPELINES = { 'stock_scraper.pipelines.MySQLPipeline': 300, } MYSQL_HOST = 'localhost' # 数据库主机 MYSQL_DATABASE = 'stock' # 数据库名 MYSQL_USER = 'root' # 数据库用户 MYSQL_PASSWORD = '789789789' # 数据库密码 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0' REDIRECT_ENABLED = False LOG_LEVEL = 'DEBUG' CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_DELAY = 1 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 5 AUTOTHROTTLE_MAX_DELAY = 60 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 ``` - **`items.py`** ```python import scrapy class StockScraperItem(scrapy.Item): stock_code = scrapy.Field() stock_name = scrapy.Field() latest_price = scrapy.Field() change_percent = scrapy.Field() change_amount = scrapy.Field() volume = scrapy.Field() turnover = scrapy.Field() amplitude = scrapy.Field() high = scrapy.Field() low = scrapy.Field() open_price = scrapy.Field() yesterday_close = scrapy.Field() ``` #### 2. 实验心得 深入掌握了Scrapy框架的JSON解析、数据提取和错误处理机制,并通过自定义管道将数据高效地存储到MySQL数据库中。爬虫通过异步请求和错误处理机制,确保了数据的完整性和爬虫的稳定性。使用pymysql库连接数据库并执行批量插入操作,提升了数据存储的效率。这次实践对Scrapy框架的灵活性和数据库操作有了更深入理解,为处理更复杂的爬虫任务打下基础。 ### 作业③:中国银行外汇牌价爬虫 #### 1. 核心代码 - **`boc_spider.py`** ```python import scrapy from boc_spider.boc_spider.items import BocExchangeRateItem from bs4 import BeautifulSoup class ExchangeRateSpider(scrapy.Spider): name = "boc_spider" start_urls = ['https://www.boc.cn/sourcedb/whpj/'] def parse(self, response): try: soup = BeautifulSoup(response.body, 'lxml') table = soup.find_all('table')[1] rows = table.find_all('tr') rows.pop(0) for row in rows: item = BocExchangeRateItem() columns = row.find_all('td') item['currency'] = columns[0].text.strip() item['cash_buy'] = columns[1].text.strip() item['cash_sell'] = columns[2].text.strip() item['spot_buy'] = columns[3].text.strip() item['spot_sell'] = columns[4].text.strip() item['exchange_rate'] = columns[5].text.strip() item['publish_date'] = columns[6].text.strip() item['publish_time'] = columns[7].text.strip() yield item except Exception as err: self.logger.error(f"An error occurred: {err}") ``` - **`pipelines.py`** ```python import pymysql from scrapy.exceptions import DropItem class BocExchangeRatePipeline(object): def __init__(self, mysql_host, mysql_db, mysql_user, mysql_password, mysql_port): self.mysql_host = mysql_host self.mysql_db = mysql_db self.mysql_user = mysql_user self.mysql_password = mysql_password self.mysql_port = mysql_port @classmethod def from_crawler(cls, crawler): return cls( mysql_host=crawler.settings.get('MYSQL_HOST'), mysql_db=crawler.settings.get('MYSQL_DB'), mysql_user=crawler.settings.get('MYSQL_USER'), mysql_password=crawler.settings.get('MYSQL_PASSWORD'), mysql_port=crawler.settings.get('MYSQL_PORT'), ) def open_spider(self, spider): self.connection = pymysql.connect( host=self.mysql_host, user=self.mysql_user, password=self.mysql_password, db=self.mysql_db, port=self.mysql_port, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor ) def close_spider(self, spider): self.connection.close() def process_item(self, item, spider): with self.connection.cursor() as cursor: sql = """ INSERT INTO exchange_rates (currency, cash_buy, cash_sell, spot_buy, spot_sell, exchange_rate, publish_date, publish_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) """ cursor.execute(sql, ( item['currency'], item['cash_buy'], item['cash_sell'], item['spot_buy'], item['spot_sell'], item['exchange_rate'], item['publish_date'], item['publish_time'] )) self.connection.commit() return item ``` - **`settings.py`** ```python BOT_NAME = "boc_spider" SPIDER_MODULES = ["boc_spider.spiders"] NEWSPIDER_MODULE = "boc_spider.spiders" MYSQL_HOST = 'localhost' MYSQL_DB = 'boc_db' MYSQL_USER = 'root' MYSQL_PASSWORD = '789789789' MYSQL_PORT = 3306 REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8" ITEM_PIPELINES = { 'boc_spider.pipelines.BocExchangeRatePipeline': 300, } ROBOTSTXT_OBEY = True ``` - **`items.py`** ```python import scrapy class BocExchangeRateItem(scrapy.Item): currency = scrapy.Field() cash_buy = scrapy.Field() cash_sell = scrapy.Field() spot_buy = scrapy.Field() spot_sell = scrapy.Field() exchange_rate = scrapy.Field() publish_date = scrapy.Field() publish_time = scrapy.Field() ``` #### 2. 实验心得 熟练运用了Scrapy框架和BeautifulSoup进行网页解析和数据提取,并通过自定义管道将抓取到的汇率数据高效存储到MySQL数据库中。爬虫通过解析多层嵌套的HTML表格结构,成功提取多种货币的现汇和现钞买入卖出价,并记录汇率的发布日期和发布时间。通过pymysql库实现与MySQL数据库的连接和数据插入操作,确保了数据的完整性和一致性,为未来处理更复杂的网页数据抓取任务积累了经验。
### 作业①:天气图片爬虫 #### 1. 核心代码 - **`weather_spider.py`** ```python import scrapy from scrapy import Request from scrapy.crawler import CrawlerProcess from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem import sys sys.path.append('D:\\张诚坤aaa老母猪批发\\weather_images') from weather_images.pipelines import WeatherImagesPipeline class WeatherSpider(scrapy.Spider): name = 'weather_spider' allowed_domains = ['weather.com.cn'] start_urls = ['http://weather.com.cn/'] def parse(self, response): images = response.css('img::attr(src)').getall() for image_url in images: if image_url.startswith('http'): yield Request(image_url, callback=self.save_image) def save_image(self, response): image_guid = response.url.split('/')[-1] yield { 'image_urls': [response.url], 'image_name': image_guid } class WeatherImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item process = CrawlerProcess(settings={ 'ITEM_PIPELINES': {'weather_images.pipelines.WeatherImagesPipeline': 1}, 'IMAGES_STORE': 'D:\\张诚坤aaa老母猪批发\\images', 'CONCURRENT_REQUESTS': 22, 'DOWNLOAD_DELAY': 1, 'CLOSESPIDER_ITEMCOUNT': 122, 'CLOSESPIDER_PAGECOUNT': 22, }) process.crawl(WeatherSpider) process.start() ``` - **`pipelines.py`** ```python import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class WeatherImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item ``` - **`settings.py`** ```python BOT_NAME = 'weather_images' SPIDER_MODULES = ['weather_images.spiders'] NEWSPIDER_MODULE = 'weather_images.spiders' ITEM_PIPELINES = { 'weather_images.pipelines.WeatherImagesPipeline': 1, } IMAGES_STORE = 'D:\\张诚坤aaa老母猪批发\\images' CONCURRENT_REQUESTS = 32 # 多线程 # CONCURRENT_REQUESTS = 1 # 单线程 DOWNLOAD_DELAY = 1 CLOSESPIDER_ITEMCOUNT = 122 CLOSESPIDER_PAGECOUNT = 22 ROBOTSTXT_OBEY = True ``` #### 2. 实验心得 在编写这个天气图片爬虫的过程中,深刻体会到Scrapy框架的强大与灵活性。它简化了网页数据抓取和图片下载过程,通过异步处理机制提高效率。利用Scrapy的`ImagesPipeline`能轻松提取图片URL并下载存储图片,同时配置并发请求数量和下载延迟,可有效控制爬虫对目标网站的压力。Scrapy的中间件和管道机制让数据处理和存储变得很灵活,提升了爬虫的可扩展性和可维护性。 ### 作业②:股票数据爬虫 #### 1. 核心代码 - **`stocks.py`** ```python import scrapy import json from scrapy.exceptions import CloseSpider from stock_scraper.items import StockScraperItem class StockSpider(scrapy.Spider): name = 'stocks' allowed_domains = ['eastmoney.com'] start_urls = [ 'https://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112409840494931556277_1633338445629&pn=1&pz=10&po=1&np=1&fltt=2&invt=2&fid=f3&fs=b:MK0021&fields=f12,f14,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f18,f15,f16,f17,f23' ] def parse(self, response): try: json_response = response.json() data = json_response.get('data', {}) stock_diff = data.get('diff', []) if not stock_diff: self.logger.warning("No stock data found in the response.") return for stock in stock_diff: item = StockScraperItem() item['stock_code'] = stock.get('f12', 'N/A') item['stock_name'] = stock.get('f14', 'N/A') item['latest_price'] = self._parse_float(stock, 'f2', 0.0) item['change_percent'] = self._parse_float(stock, 'f3', 0.0) item['change_amount'] = self._parse_float(stock, 'f4', 0.0) item['volume'] = stock.get('f5', '0') item['turnover'] = stock.get('f6', '0') item['amplitude'] = self._parse_float(stock, 'f7', 0.0) item['high'] = self._parse_float(stock, 'f15', 0.0) item['low'] = self._parse_float(stock, 'f16', 0.0) item['open_price'] = self._parse_float(stock, 'f17', 0.0) item['yesterday_close'] = self._parse_float(stock, 'f18', 0.0) yield item except json.JSONDecodeError as e: self.logger.error(f"Failed to parse JSON: {e}") raise CloseSpider("Invalid JSON response") except Exception as e: self.logger.error(f"An error occurred: {e}") raise CloseSpider("An unexpected error occurred") def _parse_float(self, stock_data, key, default=0.0): value = stock_data.get(key) if value is None: return default try: return float(value) except (ValueError, TypeError): self.logger.warning(f"Invalid value for key '{key}': {value}") return default ``` - **`pipelines.py`** ```python import mysql.connector from scrapy.exceptions import DropItem import pymysql class MySQLPipeline: def open_spider(self, spider): self.con = pymysql.connect(host="localhost", port=3306, user="root", passwd="192837465", db="stocks", charset="utf8") self.cursor = self.con.cursor(pymysql.cursors.DictCursor) self.opened = True self.count = 0 self.cursor.execute(''' CREATE TABLE IF NOT EXISTS stock ( stock_code VARCHAR(20), stock_name VARCHAR(255), latest_price FLOAT, change_percent FLOAT, change_amount FLOAT, volume VARCHAR(20), turnover VARCHAR(20), amplitude FLOAT, high FLOAT, low FLOAT, open_price FLOAT, yesterday_close FLOAT, PRIMARY KEY(stock_code) ) ''') self.con.commit() def close_spider(self, spider): self.cursor.close() self.con.close() def process_item(self, item, spider): try: self.cursor.execute(''' REPLACE INTO stock (stock_code, stock_name, latest_price, change_percent, change_amount, volume, turnover, amplitude, high, low, open_price, yesterday_close) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ''', ( item['stock_code'], item['stock_name'], item['latest_price'], item['change_percent'], item['change_amount'], item['volume'], item['turnover'], item['amplitude'], item['high'], item['low'], item['open_price'], item['yesterday_close'] )) self.con.commit() except mysql.connector.Error as e: spider.logger.error(f"Error saving item to MySQL: {e}") raise DropItem(f"Error saving item: {e}") return item ``` - **`settings.py`** ```python # Scrapy settings for stock_scraper project BOT_NAME = 'stock_scraper' SPIDER_MODULES = ['stock_scraper.spiders'] NEWSPIDER_MODULE = 'stock_scraper.spiders' ITEM_PIPELINES = { 'stock_scraper.pipelines.MySQLPipeline': 300, } MYSQL_HOST = 'localhost' # 数据库主机 MYSQL_DATABASE = 'stock' # 数据库名 MYSQL_USER = 'root' # 数据库用户 MYSQL_PASSWORD = '789789789' # 数据库密码 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0' REDIRECT_ENABLED = False LOG_LEVEL = 'DEBUG' CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_DELAY = 1 AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 5 AUTOTHROTTLE_MAX_DELAY = 60 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 ``` - **`items.py`** ```python import scrapy class StockScraperItem(scrapy.Item): stock_code = scrapy.Field() stock_name = scrapy.Field() latest_price = scrapy.Field() change_percent = scrapy.Field() change_amount = scrapy.Field() volume = scrapy.Field() turnover = scrapy.Field() amplitude = scrapy.Field() high = scrapy.Field() low = scrapy.Field() open_price = scrapy.Field() yesterday_close = scrapy.Field() ``` #### 2. 实验心得 深入掌握了Scrapy框架的JSON解析、数据提取和错误处理机制,并通过自定义管道将数据高效地存储到MySQL数据库中。爬虫通过异步请求和错误处理机制,确保了数据的完整性和爬虫的稳定性。使用pymysql库连接数据库并执行批量插入操作,提升了数据存储的效率。这次实践对Scrapy框架的灵活性和数据库操作有了更深入理解,为处理更复杂的爬虫任务打下基础。 ### 作业③:中国银行外汇牌价爬虫 #### 1. 核心代码 - **`boc_spider.py`** ```python import scrapy from boc_spider.boc_spider.items import BocExchangeRateItem from bs4 import BeautifulSoup class ExchangeRateSpider(scrapy.Spider): name = "boc_spider" start_urls = ['https://www.boc.cn/sourcedb/whpj/'] def parse(self, response): try: soup = BeautifulSoup(response.body, 'lxml') table = soup.find_all('table')[1] rows = table.find_all('tr') rows.pop(0) for row in rows: item = BocExchangeRateItem() columns = row.find_all('td') item['currency'] = columns[0].text.strip() item['cash_buy'] = columns[1].text.strip() item['cash_sell'] = columns[2].text.strip() item['spot_buy'] = columns[3].text.strip() item['spot_sell'] = columns[4].text.strip() item['exchange_rate'] = columns[5].text.strip() item['publish_date'] = columns[6].text.strip() item['publish_time'] = columns[7].text.strip() yield item except Exception as err: self.logger.error(f"An error occurred: {err}") ``` - **`pipelines.py`** ```python import pymysql from scrapy.exceptions import DropItem class BocExchangeRatePipeline(object): def __init__(self, mysql_host, mysql_db, mysql_user, mysql_password, mysql_port): self.mysql_host = mysql_host self.mysql_db = mysql_db self.mysql_user = mysql_user self.mysql_password = mysql_password self.mysql_port = mysql_port @classmethod def from_crawler(cls, crawler): return cls( mysql_host=crawler.settings.get('MYSQL_HOST'), mysql_db=crawler.settings.get('MYSQL_DB'), mysql_user=crawler.settings.get('MYSQL_USER'), mysql_password=crawler.settings.get('MYSQL_PASSWORD'), mysql_port=crawler.settings.get('MYSQL_PORT'), ) def open_spider(self, spider): self.connection = pymysql.connect( host=self.mysql_host, user=self.mysql_user, password=self.mysql_password, db=self.mysql_db, port=self.mysql_port, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor ) def close_spider(self, spider): self.connection.close() def process_item(self, item, spider): with self.connection.cursor() as cursor: sql = """ INSERT INTO exchange_rates (currency, cash_buy, cash_sell, spot_buy, spot_sell, exchange_rate, publish_date, publish_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) """ cursor.execute(sql, ( item['currency'], item['cash_buy'], item['cash_sell'], item['spot_buy'], item['spot_sell'], item['exchange_rate'], item['publish_date'], item['publish_time'] )) self.connection.commit() return item ``` - **`settings.py`** ```python BOT_NAME = "boc_spider" SPIDER_MODULES = ["boc_spider.spiders"] NEWSPIDER_MODULE = "boc_spider.spiders" MYSQL_HOST = 'localhost' MYSQL_DB = 'boc_db' MYSQL_USER = 'root' MYSQL_PASSWORD = '789789789' MYSQL_PORT = 3306 REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8" ITEM_PIPELINES = { 'boc_spider.pipelines.BocExchangeRatePipeline': 300, } ROBOTSTXT_OBEY = True ``` - **`items.py`** ```python import scrapy class BocExchangeRateItem(scrapy.Item): currency = scrapy.Field() cash_buy = scrapy.Field() cash_sell = scrapy.Field() spot_buy = scrapy.Field() spot_sell = scrapy.Field() exchange_rate = scrapy.Field() publish_date = scrapy.Field() publish_time = scrapy.Field() ``` #### 2. 实验心得 熟练运用了Scrapy框架和BeautifulSoup进行网页解析和数据提取,并通过自定义管道将抓取到的汇率数据高效存储到MySQL数据库中。爬虫通过解析多层嵌套的HTML表格结构,成功提取多种货币的现汇和现钞买入卖出价,并记录汇率的发布日期和发布时间。通过pymysql库实现与MySQL数据库的连接和数据插入操作,确保了数据的完整性和一致性,为未来处理更复杂的网页数据抓取任务积累了经验。
评论 (
0
)
登录
后才可以发表评论
状态
待办的
待办的
进行中
已完成
已关闭
负责人
未设置
标签
未设置
标签管理
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
未关联
master
第一次作业
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
参与者(1)
1
https://gitee.com/zhang-chengkun666/data-collection-and-fusion.git
git@gitee.com:zhang-chengkun666/data-collection-and-fusion.git
zhang-chengkun666
data-collection-and-fusion
数据采集融合
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册