无法从PDFA1-a格式文档中提取图像 [英] Not able to extract images from PDFA1-a format document

查看:157
本文介绍了无法从PDFA1-a格式文档中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码从PDFA1-a格式的pdf中提取图像,但无法获取图像.

I am using the following code for extracting images from pdf which is in PDFA1-a format but I am not able to get the images .

List<PDPage> list = document.getDocumentCatalog().getAllPages();

String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {

    PDResources pdResources = page.findResources();

    Map pageImages = pdResources.getImages();
    if (pageImages != null) {
        InputStream xmlInputStream = null;
        Iterator imageIter = pageImages.keySet().iterator();
        while (imageIter.hasNext()) {
            String key = (String) imageIter.next();
            PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);

            System.out.println(convertStreamToString(xmlInputStream));
            System.out.println(pdxObjectImage.hashCode());
            System.out.println(pdxObjectImage.getColorSpace().getJavaColorSpace().isCS_sRGB());

            pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
            totalImages++;

            break;
        }
    }
}

我能够使用上述代码为非正常PDF提取图像,但对于PDFA1-a格式pdf却无法提取图像.似乎是以下行

I am able to extract images for notmal PDFs using above code but am not able to extract it for PDFA1-a format pdfs. It seems the following line

PDResources pdResources = page.findResources(); 

没有返回图像,我什至尝试过page.getResources()但仍然没有得到任何图像.我什至尝试使用itext,但仍然没有给我任何图像.

is not returning images I have even tried page.getResources() but still not getting any images.I have even tried to use itext but still it is not giving me any images.

如果我尝试使用以下代码将PDF页面转换为图像

If i try to convert the page of PDF to image using the following code

BufferedImage bufferedImage = page.convertToImage();
File outputfile = new File(destinationDir+"image1.JPEG");
ImageIO.write(bufferedImage, "JPEG", outputfile);

这些图像似乎没有与之关联的元数据,因此我仍然无法知道其dpi或彩色或灰度.

these images seem to have no metadata associated with them So I still am not able to know their dpi or whether they are color or grey scale.

目前,我正在使用PDFBox进行此操作.我已经在Google上进行了2天的搜索,但仍然没有找到执行此操作的任何代码或文档.

Currently I am using PDFBox for doing this.I have already spent 2 days on this searching on google but still I havent found any code or documentation for doing this.

如何在Java中做到这一点?

How to do this in java ??

是否可以在不提取图像的情况下获取DPI或pdf是彩色的还是黑白的?

Is it possible to get DPI or whether the pdf is color or black and white without extracting the images ??

推荐答案

您的问题是两个问题的组合:

Your problems are a combination of two problems:

1)中断".您的文件有两个图像.第一个是透明的或灰色的,或者是JPEG编码的,但是不是您想要的.第二个是您想要的,但中断在第一个图像后中止.所以我只是将您的代码段更改为此:

1) the "break;". Your file has two images. The first one is transparent or grey or whatever and JPEG encoded, but it isn't the one you want. The second one is the one you want but the break aborts after the first image. So I just changed a code segment of yours to this:

while (imageIter.hasNext())
{
     String key = (String) imageIter.next();
     PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
     System.out.println(totalImages);
     pdxObjectImage.write2file("C:\\SOMEPATH\\" + fileName + "_" + totalImages);
     totalImages++;

     //break;
 }

2)您的第二张图片(有趣的图片)是JBIG2编码的.要对此进行解码,您需要在此处中提到的那样,将levigo插件添加到您的类路径中.否则,除非禁用日志记录,否则您将在1.8.8中收到此消息:

2) Your second image (the interesting one) is JBIG2 encoded. To decode this, you need to add the levigo plugin your class path, as mentioned here. If you don't, you'll get this message in 1.8.8, unless you disabled logging:

ERROR [main] org.apache.pdfbox.filter.JBIG2Filter:69 - Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.

(您没有收到该错误消息,因为它是第二个经过JBIG2编码的错误消息)

(You didn't get that error message because it is the second one that is JBIG2 encoded)

三个奖励提示:

3)如果您是自己创建的,例如在影印机上,找出如何在没有JBIG2压缩的情况下获取PDF图像,这是有些风险.

3) if you created this image yourself, e.g. on a photocopy machine, find out how to get PDF images without JBIG2 compression, it is somewhat risky.

4)不要使用pdResources.getImages(),不建议使用getImages调用.而是使用getXObjects(),然后检查迭代时得到的类型.

4) don't use pdResources.getImages(), the getImages call is deprecated. Instead, use getXObjects(), and then check the type of what you get when iterating.

 Iterator imageIter = pageImages.keySet().iterator();
 while (imageIter.hasNext())
 {
     String key = (String) imageIter.next();
     Object o = pageImages.get(key);
     if (o instanceof PDXObjectImage)
     {
         PDXObjectImage pdxObjectImage = (PDXObjectImage) o;

         // do stuff
     }
 }

5)使用foreach循环.

5) use a foreach loop.

如果还不是很明显:这与PDF/A无关:-)

And if it wasn't already obvious: this has nothing to do with PDF/A :-)

6)我忘了您还问过如何查看它是否是黑白图像,这是我在评论中提到的一些简单代码(未优化):

6) I forgot you also asked how to see if it is a b/w image, here's some simple code (not optimized) that I mentioned in the comments:

BufferedImage bim = pdxObjectImage.getRGBImage();

boolean bwImage = true;

int w = bim.getWidth();
int h = bim.getHeight();
for (int y = 0; y < h; y++)
{
    for (int x = 0; x < w; x++)
    {
        Color c = new Color(bim.getRGB(x, y));
        int red = c.getRed();
        int green = c.getGreen();
        int blue = c.getBlue();
        if (red == 0 && green == 0 && blue == 0)
        {
            continue;
        }
        if (red == 255 && green == 255 && blue == 255)
        {
            continue;
        }
        bwImage = false;
        break;
    }
    if (!bwImage)
        break;
}
System.out.println(bwImage);

这篇关于无法从PDFA1-a格式文档中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆