Ai
1 Star 0 Fork 0

cpYang/PythonForWordAndPDF

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
WordDownloader.py 1.25 KB
一键复制 编辑 原始数据 按行查看 历史
yangyao 提交于 2018-07-31 18:02 +08:00 . change file name
import urllib.request
import sys
import re
import os
# open the url and read
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
page.close()
return html
# compile the regular expressions and find
# all stuff we need
def getUrl(html):
reg = r'(?:href|HREF)="?((?:http://)?.+?\.doc(?:x|))'
url_re = re.compile(reg)
html = html.decode('utf-8') # python3
url_lst = re.findall(url_re, html)
return(url_lst)
def getFile(url):
file_name = url.split('/')[-1]
u = urllib.request.urlopen(url)
f = open(file_name, 'wb')
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
f.write(buffer)
f.close()
print("Successful to download" + " " + file_name)
# 附件下载固定url
root_url = sys.argv[1]
# 内容详情页面url
raw_url = sys.argv[2]
# root_url = 'http://www.uccb.com.cn/'
#
# raw_url = 'http://www.uccb.com.cn/notice/noticedetail.aspx?info=IVZwcTyzpEs='
html = getHtml(raw_url)
url_lst = getUrl(html)
isExists = os.path.exists(r'C:\PythonWorkspace\ldf_download')
if not isExists:
os.mkdir('ldf_download')
os.chdir(os.path.join(os.getcwd(), 'ldf_download'))
for url in url_lst[:]:
url = root_url + url
getFile(url)
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/cpYang/PythonForWordAndPDF.git
git@gitee.com:cpYang/PythonForWordAndPDF.git
cpYang
PythonForWordAndPDF
PythonForWordAndPDF
master

搜索帮助