错误:无法将org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm强制转换为org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage [英] Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

查看:516
本文介绍了错误:无法将org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm强制转换为org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pdfbox从pdf中提取图像.我从此帖子中获取了帮助.它适用于某些pdf,但不适用于其他/大多数.例如,我无法提取此文件

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file

进行一些研究后,我发现PDResources.getImages已被弃用.因此,我正在使用PDResources.getXObjects().这样一来,我将无法从PDF中提取任何图像,而是在控制台上收到此消息:

After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:

org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

现在,我被卡住了,无法找到解决方案.如果有人可以,请提供帮助.

Now I am stuck and unable to find the solution. Please assist if anyone can.

//////根据评论更新////

//////UPDATE AS REPLY ON COMMENTS///

我正在使用pdfbox-1.8.10

I am using pdfbox-1.8.10

这是代码:

public void getimg ()throws Exception {

try {
        String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
        String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
        File oldFile = new File(sourceDir);
        if (oldFile.exists()){
              PDDocument document = PDDocument.load(sourceDir);
               List<PDPage> list =   document.getDocumentCatalog().getAllPages();
               String fileName = oldFile.getName().replace(".pdf", "_cover");
               int totalImages = 1;
               for (PDPage page : list) {
                   PDResources pdResources = page.getResources();
                   Map pageImages = pdResources.getXObjects();
                    if (pageImages != null){
                      Iterator imageIter = pageImages.keySet().iterator();
                      while (imageIter.hasNext()){
                      String key = (String) imageIter.next();
                      Object obj = pageImages.get(key);

                      if(obj instanceof PDXObjectImage) {
               PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;

                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);

                     totalImages++;
                      }
                      }
                    }
               }
        }  else {
                    System.err.println("File not exist");
                       }  
}
catch (Exception e){

    System.err.println(e.getMessage());
 }
 }

////部分解决方案/////

//// PARTIAL SOLUTION/////

我已经解决了错误消息的问题.我也更新了帖子中的正确代码.但是,问题仍然相同.我仍然无法从几个文件中提取图像.像那个一样,我在这篇文章中已经提到过.在这方面的任何解决方案.

I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

推荐答案

原始代码的第一个问题是XObjects可以是PDXObjectImage或PDXObjectForm,因此需要检查实例.第二个问题是代码不会递归地遍历PDXObjectForm,表单也可以具有资源.第三个问题(仅在1.8中)是您使用getResources()而不是findResources(),getResources()不会检查更高级别.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.

1.8的代码可以在这里找到: https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup

Code for 1.8 can be found here: https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup

2.0的代码可以在这里找到: https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date

Code for 2.0 can be found here: https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date

(即使这些方法并不总是完美的,请参阅此答案)

(Even these are not always perfect, see this answer)

第四个问题是您的文件根本没有任何XObject.所有的图形"实际上都是矢量图,不能像嵌入的图像那样提取".您所要做的就是将PDF页面转换为图像,然后标记并削减您需要的东西.

The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

这篇关于错误:无法将org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm强制转换为org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆