# Python Spider1

**Repository Path**: ldb-gitee/python-spider1

## Basic Information

- **Project Name**: Python Spider1
- **Description**: Python爬虫开发与项目实战
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-28
- **Last Updated**: 2021-08-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Python Spider1

Python爬虫开发与项目实战



文件顶部声明: # coding:utf-8

写入csv文件中文乱码,参考[https://www.cnblogs.com/phyger/p/9561283.html](https://www.cnblogs.com/phyger/p/9561283.html)
，使用codecs打开文件并指定编码类型，比如：gbk

当list、tuple、dict里面有中文时，打印出来的是Unicode编码

有中文输出的时候报错:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 33: ordinal not in range(128)

解决办法是在文件顶部添加如下代码：  
```
    import sys
    print sys.getdefaultencoding()  
    reload(sys)  
    sys.setdefaultencoding('utf8')  
```
或者每个输出的字符串都加上encode('utf-8')，  
或者使用codecs.open('文件名', '写入model', encoding='utf-8')指定encoding  

url中有转义符处理： urllib.unquote(str(url))


# selenium
- 有时候input调用clear()方法并没有清除掉原来的值
- 错误 Other element would receive the click  
-- 方法一：通过执行js完成点击，比如：
    driver.execute_script('arguments[0].click()', next_page)，参考[https://www.cnblogs.com/yp19970/p/12888881.html]
-- 方法二：通过ActionChains，比如：
    先通过ActionChains(driver).move_to_element(next_page)移动到元素，然后再next_page.click()

# scrapy

## 创建项目
- scrapy startproject yunqiCrawl
- cd yunqiCrawl
- scrapy genspider -t crawl yunqi.qq.com yunqi.qq.com


## 安装
- 安装pywin32
-- 地址：https://github.com/mhammond/pywin32
-- 我是直接pip install pywin32

- 安装pyOpenSSL
-- 地址：https://github.com/pyca/pyopenssl
-- 在安装的时候遇到了问题：
    1、ImportError: No module named setuptools_rust，直接pip install setuptools_rust
    2、cryptography安装失败，直接手动安装 pip install cryptography 参考[https://blog.csdn.net/u25th_engineer/article/details/112385052]
-- 安装lxml, pip install lxml
-- 安装Scrapy，一开始安装的是1.8.0版本，后来执行代码的时候有报错，就卸载重新安装了1.0.5的版本，但是还是报错，具体看后面的问题。


## 问题
- 当时使用的是python2.7版本，在安装某些模块的时候安装了最新的，导致了一些问题，把对应模块卸载后重新安装低版本

- UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0'
-- 原因：网页源代码中的&nbsp; 的utf-8 编码是：\xc2 \xa0，通过后，转换为Unicode字符为：\xa0，
    当显示到DOS窗口上的时候，转换为GBK编码的字符串，但是\xa0这个Unicode字符没有对应的 GBK 编码的字符串，所以出现错误。
    [参考](https://blog.csdn.net/github_35160620/article/details/53353672?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-5.control&dist_request_id=1619540138457_26705&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-5.control)
-- 解决办法：用空格替换，比如：content = content.replace(u'\xa0', u' ')

- scrapy contextlib ImportError: cannot import name suppress
-- 原因：当时是queuelib版本问题
-- 解决办法：卸载重新安装queuelib 19.10.0版本

- 'float' object is not iterable
-- 原因：Twisted版本问题
-- 解决办法：安装Twisted 16.6.0版本 pip install Twisted==16.6.0
  

# 问题
- import urllib.parse ImportError: No module named parse
-- python版本不对，需要把依赖urllib.parse进行降级  
  [参考](https://www.cnblogs.com/zishengY/articles/9336897.html)
  
- Deprecated option 'domaincontroller': use 'http_authenticator.domain_controller' instead.
-- wsgidav 重新安装版本2.4.1
  
- ImportError: cannot import name DispatcherMiddleware
--