用PHP从PDF中提取图像 [英] extract images from PDF with PHP

查看:130
本文介绍了用PHP从PDF中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

事情是,客户想要上传带有图像的pdf,作为一次批量处理多张图片的方式。

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once.

我已经环顾四周了PHP PHP无法读取。

I already looked around and out of the box PHP can't read PDF's.

我的替代方案是什么?

我已经知道主机没有已安装 imageMagick 或任何pdf库,并禁用 exec 功能。这基本上让我没有任何工作,我猜?

I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. That's basicly leaving me with nothing to work with, I guess?

有没有人知道是否有一个可以做到这一点的在线服务,有各种各样的api?

Does anyone know if there is an online service that can do this, with an api of sorts?

感谢adv

推荐答案

AFAIK,没有PHP模块可以做它。有一个命令行工具, pdfimages xpdf )。作为参考,这是如何工作:

AFAIK, there is no PHP module to do it. There is a command line tool, pdfimages (part of xpdf). For reference, here's how that works:

pdfimages -j source.pdf image

这将从source.pdf中提取所有图像,如image-000.jpg,image-001.jpg等。请注意,输出格式始终为Jpeg。

Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg.

可能的选项

作为命令行工具,您需要 exec (或 system passthru ,任何执行的命令执行函数进入PHP)。由于您的环境没有,我看到四个选项:

Being a command line tool, you need exec (or system, passthru, any of the command executing functions built into PHP). As your environment doesn't have that, I see four options:


  1. 请求为您启用exec(您的托管服务提供商可以限制什么你可以执行单个命令)

  2. 更改设计 - ZIP上传怎么样?

  3. 使用源代码滚动自己 pdfimages 作为模特

  4. pdfimages 做繁重的工作,在你控制的远程主机上运行它

  1. Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command)
  2. Change the design -- how about a ZIP upload?
  3. Roll your own, using the source code of pdfimages as a model
  4. Let pdfimages do the heavy lifting, by running it on a remote host you do control

关于#3,滚动你自己,我不认为滚动你自己,解决一个非常狭窄的要求定义,会太困难。我似乎记得PDF中的图像边界定义得很好:只需将文件读入边界,切割到边界的末尾,base64_decode,然后写入文件 - 重复。但是,这可能太多了......

Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. However, that may be too much...

如果自己滚动太复杂了,那么选项#4有点像 Joel Spolsky描述了使用复杂的Excel对象(请参阅标题下的编号列表让Office执行为你工作繁重)。

If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you").


  • 找一个便宜的托管环境(例如亚马逊EC2)让你 exec curl

  • 安装 pdfimages

  • 编写一个PHP脚本,将URL带到PDF,curl打开PDF,将其写入磁盘,将其传递给pdfimages,然后将URL返回到生成的图像。

  • Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl
  • Install pdfimages
  • Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images.

示例交换可能如下所示:

An example exchange could look like this:

GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf

Content-type: text/html


<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>

所以你的单个pdfimages.php脚本(在主机上运行 exec 功能)既可以提取图像,也可以访问提取的图像。在提取时,它会读取您告诉它的PDF,在其上运行pdfimages,并返回一个要调用的URL列表以检索提取的图像。检索时,它只是给你一个直的图像。

So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. When retrieving, it just gives you back a straight image.

你需要处理清理,也许要做的事情就是在检索后删除图像。您还需要处理安全性 - 不知道这些图像中的内容,但内容可能需要包含在SSL中并采取其他预防措施。

You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.

这篇关于用PHP从PDF中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆