pdfbox和itext无法提取图像 [英] pdfbox and itext not able to extract image

查看:194
本文介绍了pdfbox和itext无法提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从pdf中提取图像。 pdfbox能够从大多数pdf中提取图像,但它们是一些pdf,其图像不会被pdfbox提取。

I am trying to extract images from a pdf . pdfbox is able to extract images from most of the pdfs but their are some pdfs whose images are not getting extracted by pdfbox.

为了提取图像我使用下面的代码:
无法从PDFA1中提取图像 - 格式文档

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

你可以下载此链接中有此问题的示例pdf:
http://myslams.com/test/ 2.pdf

You can download a sample pdf with this problem from this link : http://myslams.com/test/2.pdf

他们的错误代码可能是我忘了处理的东西,或者他们的pdf一起出错了吗?

is their something wrong the code maybe something I forgot to handle or is their something wrong with the pdf all together ?

推荐答案

由于OP还没有用工作的方式替换他过时的PDF样本链接,因此问题只能用一般性的方式来回答。

OP引用的代码(带有@Tilman答案中的更正)迭代每个页面的直接图像资源并存储相应的文件。

The code referenced by the OP (with the corrections in the answer of @Tilman) iterates the immediate image resources of each page and stores the respective files.

因此,代码可以存储太多图片,因为页面的图片资源可能不一定在相关页面上使用:

Thus, the code may store too many images because image resources of a page may not necessarily be used on the page in question:


  1. 一个它可能根本没有在文件中使用,或者至少在任何可见的地方都没有使用,只是从之前的一些PDF编辑会话中遗留下来。

  2. 另一方面,多个页面可能有共享包含所有这些页面上所有图像的资源字典;在这种情况下,OP的代码会导出许多重复项。

代码可能存储太少图像,因为那里是其他可以放置图像的地方:

And the code may store too few images because there are other places where images may be put:


  1. 图像数据可以直接包含在页面内容流中,也就是内联图像。

  2. 使用页面内容中使用的自己的资源(表单xobjects,模式,类型3字体字形)构造可能会提供自己的图像资源或内联图像。

  3. 注释,例如AcroForm表单字段,也可能有自己的外观流和自己的资源,因此,也可以提供自己的图像资源或内联图像。

  4. XFA表单可以提供自己的图像,

  1. Image data may be directly included in the page content stream, aka inline images.
  2. Constructs with their own resources (form xobjects, patterns, Type 3 font glyphs) used from the page content may provide their own image resources or inline immages.
  3. Annotations, e.g. AcroForm form fields, may have also their own appearance streams with their own resources and, therefore, may provide their own image resources or inline immages, too.
  4. XFA forms may provide their own images, too.

一旦OP提供代表性样本文件,就可以确定他错过的图像类型,并且可以使用特定的解决方案概述。

As soon as the OP provides a representative sample file, the type of images he misses can be determined and a specific solution may be outlined.

编辑

根据OP的评论,他的图像提取问题已通过使用此答案中的信息解决了他的问题\"pdfbox和itext提取图像的dpi不正确。特别指出适用于OP sems使用的PDFBox版本1.8.8的示例代码非常重要。

According to a comment by the OP, his image extraction problems have been resolved by making use of the information from this answer to his question "pdfbox and itext extracting image with incorrect dpi". Especially pointing to example code appropriate for the PDFBox version 1.8.8 used by the OP sems to have been important.

因此,任何类型的错误输出也可能由于软件编排问题而发生。

Thus, any kind of wrong output may also occur as a result of software orchestration issues.

这篇关于pdfbox和itext无法提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆