# AmazonPython **Repository Path**: sopsoft/AmazonPython ## Basic Information - **Project Name**: AmazonPython - **Description**: An Amazon Crawler|Python|第一版亚马逊超级爬虫 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-29 - **Last Updated**: 2021-07-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # An Amazon Crawler # ERROR Python function arg change will cause a big problem so must to make a new variable by copy.deepcopy ## Source Framework Develope by Python, Look at the following: spider (Crawler module) --- download (Crawler Download Module) --- parse (Crawler Parser Module) --- logic (Crawler Logic Module) bin (Crawler Execution File) --- spider (Crawler Entrance) --- tool (Auxiliary Tool) config (Config Module) --- base (Config File) --- config.py config.json log.json tool (Basic Tool) --- jfile --- jhttp --- jjson --- jmysql --- log.py acion (Action Module,Such as proxy IP,Useragent...) test (Test Dir) data (Data Keep) log (Log Keep) client (Export Data) doc (Help Doc) ## Third Party Library (to be installed) ``` pip3 install xlsxwriter pip3 install pymysql pip3 install requests pip3 install bs4 pip3 install redis yum install libxslt-devel pip3 install lxml pip3 install -U selenium pip3 install requests[socks] ``` ## Setting Environment Variables ``` set PYTHONPATH="G:/smartdo" Window export name="path" Linux ``` ## Using of the Basic Tool Please look at the test dir for example 1.jjson (JSON Deal Package) ``` json字符串解析成对象 def stringToObject(jstring) json字符串校验是否正确,可打印错误 def isRightJson(jstring, printerror=False) 对象解析成json字符串,支持排序和缩进 def objectToString(jobject, sort=False, indent=None) 格式化json字符串,默认按键排序 def formatStringToString(jstring, sort=True) 格式化json字符串,并可选择存入文件 def formatStrigToFile(filepath, sort=True, filesavepath="") ``` 2.jhttp (Network Package) ``` 自己封装的抓取函数 getHtml(url, daili='', postdata={}, header={}) header: {'User-Agent':'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5', 'Referer':'http://s.m.taobao.com', 'Host':'h5.m.taobao.com' } postdata: {"dd":"dd"} 数据URL转义 def urlencode(postdata={}): ``` 3.jfile (File Package) ``` 找出文件夹下所有xml后缀的文件,可选择递归,选择全路径 def listfiles(rootdir, prefix='.xml', isall=False, iscur=False) 将数据写入Excel def writeexcel(path, dealcontent=[]) 去除标题中的非法字符 (Windows) def validateTitle(title) 递归创建文件夹 def createjia(path) 今天日期的字符串 def todaystring(level=3) 获取文件类型,传入文件名 def filetype(filename) 文件路径拼接 def filejoin(file=[]) 从文件中读取行,变成列表 def readfilelist(filepath) 时间函数 def timetochina(longtime, formats='{}天{}小时{}分钟{}秒') today=time.strftime('%Y%m%d', time.localtime()) a=time.clock() b=time.clock() print('运行时间:'+timetochina(b-a)) 判断文件是否存在 def fileexsit(path) 切分文件列表 def devidelist(files=[],num=0) 取得URL参数 def geturlattr(url) 拼接参数join def joinany(things, sep=",") 文件批量改名 def renamedir(path, oprefix="md", nprefix="txt") ``` 4.jmysql (Database Package) ``` config = {"host": "localhost", "user": "root", "pwd": "6833066", "db": "doubanbook"} mysql=Mysql(config) mysql.ExecNonQuery("insert into `booktag` (bookname) values ('你哈') ") mysql.ExecQuery('SELECT bookname,bookkind,bookno FROM booktag limit 0,10;') ``` ``` Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Readability counts. ``` ## Over the wall(In China) ``` C:\Windows\System32\drivers\etc put host Windows: press Windows+R key,run cmd ,Run command at the command prompt: ipconfig /flushdns ``` ## How to Use? Doc!!! ## Question 协同作战 ``` git branch -r -d origin/branch-name git push origin :branch-name ``` 命令等 ``` yum install nethogs -y nethogs killall -9 python3 ``` 浏览器 ``` https://selenium-python.readthedocs.io/ https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ ``` VPN ``` http://www.wanghailin.cn/centos-7-vpn/ ```