如何使用iText以正确的顺序从PDF中提取图像? [英] How to extract images from a PDF with iText in the correct order?

查看:1041
本文介绍了如何使用iText以正确的顺序从PDF中提取图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从PDF文件中提取图像。我在网上找到了一个很好的例子:

I am trying to extract images from a PDF file. I found an example on the web, that worked fine:

    PdfReader reader;

    File file = new File("example.pdf");
    reader = new PdfReader(file.getAbsolutePath());
    for (int i = 0; i < reader.getXrefSize(); i++) {
        PdfObject pdfobj = reader.getPdfObject(i);
        if (pdfobj == null || !pdfobj.isStream()) {
            continue;
        }
        PdfStream stream = (PdfStream) pdfobj;
        PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
        if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
            byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
            FileOutputStream out = new FileOutputStream(new File(file.getParentFile(), String.format("%1$05d", i) + ".jpg"));
            out.write(img);
            out.flush();
            out.close();
        }
    }

这给了我所有的图像,但图像是在错误的顺序。我的下一次尝试看起来像这样:

That gave me all the images, but the images were in the wrong order. My next attempt looked like this:

for (int i = 0; i <= reader.getNumberOfPages(); i++) {
  PdfDictionary d = reader.getPageN(i);
  PdfIndirectReference ir = d.getAsIndirectObject(PdfName.CONTENTS);
  PdfObject o = reader.getPdfObject(ir.getNumber());
  PdfStream stream = (PdfStream) o;
  // rest from example above
}

虽然是o.isStream() == true,我只得到/ Length和/ Filter,流只有大约100个字节长。根本找不到图像。

Although o.isStream() == true, I only get /Length and /Filter and the stream is only about 100 bytes long. No image to be found at all.

我的问题是以正确的顺序从PDF文件中获取所有图像的正确方法。

My question would be what the correct way would be to get all the images from a PDF file in the correct order.

推荐答案

我在其他地方找到了答案,即iText邮件列表。

I found an answer elsewhere, namely the iText mailing list.

以下代码适用于我:

PDDocument document = null; 
document = PDDocument.load(inFile); 
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator(); 
while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            PDResources resources = page.getResources();
            Map pageImages = resources.getImages();
            if (pageImages != null) { 
                Iterator imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2OutputStream(/* some output stream */);
                }
            }
}

这篇关于如何使用iText以正确的顺序从PDF中提取图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆