用PHP从PDF中提取图像 [英] extract images from PDF with PHP

查看：130 发布时间：2018/7/26 14:48:08 php image pdf

本文介绍了用PHP从PDF中提取图像的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

事情是，客户想要上传带有图像的pdf，作为一次批量处理多张图片的方式。

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once.

我已经环顾四周了PHP PHP无法读取。

I already looked around and out of the box PHP can't read PDF's.

我的替代方案是什么？

我已经知道主机没有已安装 imageMagick 或任何pdf库，并禁用 exec 功能。这基本上让我没有任何工作，我猜？

I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. That's basicly leaving me with nothing to work with, I guess?

有没有人知道是否有一个可以做到这一点的在线服务，有各种各样的api？

Does anyone know if there is an online service that can do this, with an api of sorts?

感谢adv

推荐答案

AFAIK，没有PHP模块可以做它。有一个命令行工具， pdfimages （ xpdf ）。作为参考，这是如何工作：

AFAIK, there is no PHP module to do it. There is a command line tool, pdfimages (part of xpdf). For reference, here's how that works:

pdfimages -j source.pdf image

这将从source.pdf中提取所有图像，如image-000.jpg，image-001.jpg等。请注意，输出格式始终为Jpeg。

Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg.

可能的选项

作为命令行工具，您需要 exec （或 system ， passthru ，任何执行的命令执行函数进入PHP）。由于您的环境没有，我看到四个选项：

Being a command line tool, you need exec (or system, passthru, any of the command executing functions built into PHP). As your environment doesn't have that, I see four options:

请求为您启用exec（您的托管服务提供商可以限制什么你可以执行单个命令）

更改设计 - ZIP上传怎么样？

使用源代码滚动自己 pdfimages 作为模特

让 pdfimages 做繁重的工作，在你控制的远程主机上运行它

Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command)
Change the design -- how about a ZIP upload?
Roll your own, using the source code of pdfimages as a model
Let pdfimages do the heavy lifting, by running it on a remote host you do control

关于＃3，滚动你自己，我不认为滚动你自己，解决一个非常狭窄的要求定义，会太困难。我似乎记得PDF中的图像边界定义得很好：只需将文件读入边界，切割到边界的末尾，base64_decode，然后写入文件 - 重复。但是，这可能太多了......

Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. However, that may be too much...

如果自己滚动太复杂了，那么选项＃4有点像 Joel Spolsky描述了使用复杂的Excel对象（请参阅标题下的编号列表让Office执行为你工作繁重）。

If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you").

找一个便宜的托管环境（例如亚马逊EC2）让你 exec 和 curl

安装 pdfimages

编写一个PHP脚本，将URL带到PDF，curl打开PDF，将其写入磁盘，将其传递给pdfimages，然后将URL返回到生成的图像。

Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl
Install pdfimages
Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images.

示例交换可能如下所示：

An example exchange could look like this:

GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf

Content-type: text/html


<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>

所以你的单个pdfimages.php脚本（在主机上运行 exec 功能）既可以提取图像，也可以访问提取的图像。在提取时，它会读取您告诉它的PDF，在其上运行pdfimages，并返回一个要调用的URL列表以检索提取的图像。检索时，它只是给你一个直的图像。

So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. When retrieving, it just gives you back a straight image.

你需要处理清理，也许要做的事情就是在检索后删除图像。您还需要处理安全性 - 不知道这些图像中的内容，但内容可能需要包含在SSL中并采取其他预防措施。

You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.

这篇关于用PHP从PDF中提取图像的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用PHP从PDF中提取图像 [英] extract images from PDF with PHP

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

用PHP从PDF中提取图像 [英] extract images from PDF with PHP

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭