# LearnSpider **Repository Path**: S1xe/LearnSpider ## Basic Information - **Project Name**: LearnSpider - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2018-01-29 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #### Scrapy的使用 分析里面具体内容 以[A Light in the Attic](http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html)为例 ```python scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html view(response) sel=response.css('table.table.table-striped') sel.xpath('(.//tr)[1]/td/text()').extract() ['a897fe39b1053632'] sel.xpath('(.//tr)[last()-1]/td/text()').re_first('\((\d+) available\)') '22' sel.xpath('(.//tr)[last()]/td/text()').extract() ['0'] ``` 获取每一个的链接 ```python fetch('http://books.toscrape.com/') view(response) from scrapy.linkextractors import LinkExtractor le=LinkExtractor(restrict_css='article.product_pod') le.extract_links(response) ``` 创建项目 ```shell scrapy startproject toscrape_book cd toscrape_book/ scrapy genspider books books.toscrape.com ``` ### 爬虫的编写顺序 1. 继承Spider创建爬虫类 2. 为爬虫取名 3. 指定其实爬取点 4. 完成列表页面的解析函数 5. 实现页面的解析函数 ### HTTP proxy ``` http://proxy-list.org https://free_proxy-list.net http://www.xicidaili.com http://www.proxy360.cn http://www.kuaidaili.com ``` ### HTTP check ``` http://httpbin.org/ip ``` ### NOTE 1. 多样式带空格 如果出现了 ``这种样式,那么应该选择`response.css('table.table.table-striped')`