1 Star 0 Fork 0

面壁者Y/pdf2image

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

pdf2image TravisCI PyPI version codecov Downloads

A python 2.7 and 3.4+ module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

First you need poppler-utils

pdftoppm and pdftocairo are the piece of software that do the actual magic. It is distributed as part of a greater package called poppler.

Using pip

Windows users will have to install poppler for Windows, then add the bin/ folder to PATH.

Mac users will have to install poppler for Mac.

Linux users will have both tools pre-installed with Ubuntu 16.04+ and Archlinux. If it's not, run sudo apt install poppler-utils

Using conda

conda install -c conda-forge poppler

Then you can install the pip package!

pip install pdf2image

Install Pillow if you don't have it already with pip install pillow

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

OR

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
     # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None)

What's new?

  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path
  • Fixed a bug where PNGs buffer with a non-terminating I-E-N-D sequence would throw an exception
  • Fixed a bug that left open file descriptors when using convert_from_bytes() (Thank you @FabianUken)
  • fmt='tiff' parameter allows you to create .tiff files (You need pdftocairo for this)
  • transparent parameter allows you to generate images with no background instead of the usual white one (You need pdftocairo for this)
  • strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
  • use_cropbox parameter allows you to use the crop box instead of the media box when converting (-cropbox in pdftoppm's CLI)

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
MIT License Copyright (c) 2017 Edouard Belval Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object 展开 收起
Python
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/luciferpy/pdf2image.git
git@gitee.com:luciferpy/pdf2image.git
luciferpy
pdf2image
pdf2image
master

搜索帮助