Ai
1 Star 0 Fork 0

github_repo/ocr-table

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
pdf_miner.py 1.02 KB
一键复制 编辑 原始数据 按行查看 历史
cseas 提交于 2018-11-21 15:02 +08:00 . Add alternate
# Alternate approach using pdfminer
try:
from cStringIO import StringIO
except ImportError:
from io import BytesIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import re
def convert(fname):
pages=None
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = BytesIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
print(text)
# write to .txt
text_file = open("output.txt", "w")
text = re.sub("\s\s+", " ", text.decode('utf-8'))
text_file.write("%s" % text)
text_file.close()
convert("input.pdf")
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/github_repo/ocr-table.git
git@gitee.com:github_repo/ocr-table.git
github_repo
ocr-table
ocr-table
master

搜索帮助