Ai
1 Star 0 Fork 0

cpYang/PythonForWordAndPDF

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
PDFReader.py 1.33 KB
一键复制 编辑 原始数据 按行查看 历史
yangyao 提交于 2018-07-31 17:57 +08:00 . Initial commit
# 抓取并读取网页pdf
# pdf READ operation
from urllib.request import urlopen
from urllib.error import URLError
from urllib.error import HTTPError
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO, open
import sys
import os
# 也可以读取由pdffile=open("../../readme.pdf")语句打开的本地文件。
url = sys.argv[1]
# url = 'http://www.ynhtbank.com/ynhtyh/resource/cms/article/sub295208/378927/2018072715111046526.pdf'
def readPDF(filename):
resmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(resmgr, retstr, laparams=laparams)
process_pdf(resmgr, device, filename)
device.close()
content = retstr.getvalue()
retstr.close()
return content
try:
pdffile = urlopen(url)
except (URLError, HTTPError) as e:
print("Errors:\n")
print(e)
# 写到文件pdftext.txt中
if os.path.exists(r'C:\PythonWorkspace/pdftext.txt'):
os.remove('C:\PythonWorkspace/pdftext.txt')
outputString = readPDF(pdffile)
with open('C:\PythonWorkspace/pdftext.txt', 'a', encoding='utf-8') as f:
f.write(''.join(outputString))
pdffile.close()
# 输出到console控制台
# outputString = readPDF(pdffile)
# print(outputString)
# pdffile.close()
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/cpYang/PythonForWordAndPDF.git
git@gitee.com:cpYang/PythonForWordAndPDF.git
cpYang
PythonForWordAndPDF
PythonForWordAndPDF
master

搜索帮助