# 红薯中文网小说爬取 **Repository Path**: asoule/hongshu ## Basic Information - **Project Name**: 红薯中文网小说爬取 - **Description**: 红薯中文小说网,采用了js动态加载小说内容部分汉字或符号。本仓库介绍如何破解js将小说内容完整的爬取下来。 - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 0 - **Created**: 2019-08-22 - **Last Updated**: 2023-07-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 红薯中文网小说爬取 #### 描述 红薯中文小说网,采用了js动态加载小说内容部分汉字或符号。本仓库介绍如何破解js将小说内容完整的爬取下来。 我们首先用浏览器F12查看下页面情况,页面连接:https://g.hongshu.com/content/93416/13877912.html 图片中少量汉字都是使用```span```标签对应上去的,我们在点击span标签的时候在右边可以看到 ```css .context_kw23::before { content: "地"; } ``` 很显然,这些内容是通过js动态加载上去的。我们仔细观察所有span标签,看他的类属性都是以```context_kw```开头,后面再加上数字编号。我们可以大胆猜测每一个数字编号对应一个汉字或符号。然后我们再研究js。与我们相关的js就在响应的html中。js 研究发现编码对应的汉字或符号在```words```变量中,索引与数字编号对应。 上图是调试js的截图 - data是加密过的内容,keywords是解码的密码,通过js中的解密函数生成了编码的列表secWords,然后通过```fromCharCode```函数生成对应的汉字列表words js代码写的很复杂,其实大可不必研究其生成过程(先研究的可以从words生成地方往回推),可以将js代码复制到本地进行运行,生成我们所需要的words列表即可。 附上全部代码: ```python import requests from lxml import etree import re #import execjs def seedRequest(url,header): response = requests.get(url=url,headers=header) response.encoding = "utf8" return response.text def func(obj): span = obj.group(0) num = re.findall('context_kw(\d{1,2})', span, re.I)[0] try: replace = words[int(num)] return replace except KeyError: print("未找到编号%s的汉字"%num) return "#" def htmlReplace(html): responseReplace = re.sub('', func, html) return responseReplace # def get_words_js(): # with open("./getWords.js","r",encoding="utf8") as fp: # js = fp.read() # ctx = execjs.compile(js) # return ctx.call('parseWord') def getContentHtml(html): tree = etree.HTML(html) title = tree.xpath("//div[@class='lf']/h1/text()")[0] text = tree.xpath("///div[@class='rdtext']/p/text()") concatText = "".join(text) print(title) print(concatText) if __name__ == '__main__': header = { "authority": "g.hongshu.com", "method": "GET", "path": "/content/93416/13877912.html", "scheme": "https", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "cookie": "pgv_pvi=7004054528; bookfav=%7B%22b93416%22%3A0%7D; pgv_si=s2563727360; Hm_lvt_e966b218bafd5e76f0872a21b1474006=1566288274,1566295321,1566460817; Hm_lpvt_e966b218bafd5e76f0872a21b1474006=1566460817; yqksid=u5j08hk2dgmrtj0hirfv0niss2", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" } url = "https://g.hongshu.com/content/93416/13877912.html" words = [",","的","。","刘","人","一","他","是","不","在","有","了","着","”","“","秀","大","上","道","歆","个","名","下","地"] html = seedRequest(url,header) getContentHtml(htmlReplace(html)) ``` > 上面代码注释掉的函数是python运行本地js的函数,有兴趣的可以研究下。我的因为js文件中有语法问题运行失败了(在浏览器运行没问题) ## 2019-8-24更新 **经多次测试发现,每个页面的words都不一样,手动获取words列表不适合大规模爬取。如果使用seleumns的话效率又跟不上,所以我又研究了下如何在本地运行js代码。** 首先,我们要将响应页面中的js内容分离出来,并做一定的修改: ```python def createJs(response): #将响应内容的js内容写入文件,并删除修改部分内容 jsText = re.findall('',response,flags=re.S)[0] jsText = re.sub('(for\(var i=0x0;i js内容中含有document、window对象,这些对象的调用必须浏览器中才能运行。所以要想在本地运行,这些内容必须删除或替换掉。上面代码中就做了此处理。具体做了哪些修改可参考GitHub中```example.js```文件。 接下来便可以本地运行js文件了 ```python def get_words_js(js): #运行js代码,返回words列表 # with open("./createJs.js","r",encoding="utf8") as fp: # js = fp.read() data = js2py.eval_js(js) return data() ``` GitHub仓库中有```testJs.py```文件,运行如下。 获取到words列表就可以用正则替换掉span标签,然后就可以正常解析你想要的内容了。 #### 附上修改后的完整代码 ```python import requests from lxml import etree import re import js2py # 运行js的库 def seedRequest(url,header): #发送请求,返回响应内容 response = requests.get(url=url,headers=header) response.encoding = "utf8" print(response.status_code) return response.text def createJs(response): #将响应内容的js内容写入文件,并删除修改部分内容 jsText = re.findall('',response,flags=re.S)[0] jsText = re.sub('(for\(var i=0x0;i', func, html) return responseReplace def getContentHtml(response): #解析页面,获取小说内容 tree = etree.HTML(response) title = tree.xpath("//div[@class='lf']/h1/text()")[0] text = tree.xpath("///div[@class='rdtext']/p/text()") concatText = "".join(text) print(title) print(text) if __name__ == '__main__': header = { "authority": "g.hongshu.com", "method": "GET", "path": "/content/93416/13877912.html", "scheme": "https", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "cookie": "pgv_pvi=7004054528; bookfav=%7B%22b93416%22%3A0%7D; pgv_si=s2563727360; Hm_lvt_e966b218bafd5e76f0872a21b1474006=1566288274,1566295321,1566460817; Hm_lpvt_e966b218bafd5e76f0872a21b1474006=1566460817; yqksid=u5j08hk2dgmrtj0hirfv0niss2", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" } url = "https://g.hongshu.com/content/93416/13901181.html" # words = [",","的","。","刘","人","一","他","是","不","在","有","了","着","”","“","秀","大","上","道","歆","个","名","下","地"] html = seedRequest(url,header) words = get_words_js(createJs(html)) getContentHtml(htmlReplace(html)) ``` 运行结果截图: #### 本人原创,请多多指正 #### 联系我:768348710@qq.com