从pdf文档中提取图像 [英] extract images from pdf document

查看:279
本文介绍了从pdf文档中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道以前也曾问过类似的问题,但是它们已经过时了(有些可以追溯到2006年).

I know similar questions have been asked before, however, they are hideously out of date (some going back to 2006).

我有一个.net 3.5应用程序(带有iTextSharp 5),我正在转换为.net核心(iText 7),该程序从FedEx跟踪文档中提取签名,并通过SOAP服务以byte []数组的形式发送.多年来,此代码在进行较小的更新后一直运行良好.从Fedex返回的PDF文档中有几张图像,但是签名块不是110x46图像(这是pdf文件中的fedex徽标,因此为什么我跳过它.)

I have a .net 3.5 app (w/ iTextSharp 5) I am converting to .net core (iText 7) which extracts signatures from FedEx tracking documents, sent in a byte[] array via a SOAP service. This code has worked very well for many years now with minor updates. There are a couple of images in the PDF document returned from Fedex but the signature block is not the 110x46 image (which is the fedex logo in the pdf file, hence why I skip over it.)

PdfReader pdf = new PdfReader(FedexData);

for(Int32 iPage = 1; iPage <= pdfReader.NumberOfPages; iPage++)
{
   PdfDictionary pg = pdf.GetPageN(iPage);
   PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
   PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));

   foreach(PdfName name in xobj.Keys)
   {
      PdfObject obj = xobj.Get(name);

      if(obj.IsIndirect())
      {
          PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
          String width = tg.Get(PdfName.WIDTH).ToString();
          String height = tg.Get(PdfName.HEIGHT).ToString();
          String decode = tg.Contains(PdfName.DECODEPARMS) ? tg.Get(PdfName.DECODEPARMS).ToString() : "";
          String bitspercomponent = tg.Contains(PdfName.BITSPERCOMPONENT) ? tg.Get(PdfName.BITSPERCOMPONENT).ToString() : "";
          String colorspace = tg.Contains(PdfName.COLORSPACE) ? tg.Get(PdfName.COLORSPACE).ToString() : "";
          if(width != "110" && height != "46" && bitspercomponent != "1")
          {
                ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new GraphicsState(), (PRIndirectReference)obj, tg);
                PdfImageObject image = imgRI.GetImage();
                Image dotnetImg = image.GetDrawingImage();

                if(dotnetImg != null)
                {
                // process image and update database

足以说明此代码不适用于iText7.我试图移植其中的一部分,但似乎没有得到图像..因此,我显然做错了一些事情,但我自己对iText7函数的无知,似乎并不能提供与旧库的向下兼容性.

Suffice to say this code doesn't work with iText7. I attempted to port some of it but I do not seem to be getting the images.... so I'm clearly doing something incorrect and its my own ignorance of the iText7 functions which do not seem to offer downward compatibility with the older library.

有人可以给我指出iText7教程,该教程涉及提取存储在PDF文件中的图像吗?我找到了有关如何将PDF提取为图像(不是我想要的),如何将图像存储在PDF文档中(与我想要的相反)的教程,并且类似的问题答案都基于不再起作用的较早的库

Can someone point me to a tutorial for iText7 which deals with extracting the images stored in a PDF file? I have found tutorials on how to extract a PDF as an image (not what I want), how to store images in a PDF document (opposite of what I want), and similar questions with answers are based on older libraries which no longer function.

谢谢, Vin

推荐答案

这是IEventListener的Java实现,可用于访问特定页面上的所有图像:

Here is a Java implementation of an IEventListener which you can use to access all images from a specific page:

public class MyImageRenderListener implements IEventListener {

    protected String path;
    protected String extension;

    public MyImageRenderListener(String path) {
        this.path = path;
    }

    public void eventOccurred(IEventData data, EventType type) {
        switch (type) {
            case RENDER_IMAGE:
                try {
                    String filename;
                    FileOutputStream os;
                    ImageRenderInfo renderInfo = (ImageRenderInfo) data;
                    PdfImageXObject image = renderInfo.getImage();
                    if (image == null) {
                        return;
                    }

                    // You can access various value from dictionary here:
                    PdfString decodeParamsPdfStr = image.getPdfObject().getAsString(PdfName.DecodeParms);
                    String decodeParams = decodeParamsPdfStr != null ? decodeParamsPdfStr.toUnicodeString() : null;                      

                    byte[] imageByte = image.getImageBytes(true);
                    extension = image.identifyImageFileExtension();
                    // You can use raw image bytes directly, or write image to disk
                    filename = String.format(path, image.getPdfObject().getIndirectReference().getObjNumber(), extension);
                    os = new FileOutputStream(filename);
                    os.write(imageByte);
                    os.flush();
                    os.close();
                } catch (com.itextpdf.io.IOException | IOException e) {
                    System.out.println(e.getMessage());
                }
                break;

            default:
                break;
        }
    }

    public Set<EventType> getSupportedEvents() {
        return null;
    }
}

我已经评论了一些您可能感兴趣的部分.

I've commented some of the parts that may be of interest to you.

以下是实际上为所有页面或感兴趣的任何页面调用处理器的代码:

And here is the code that actually invokes the processor for all pages, or for any pages of interest:

PdfDocument pdfDoc = new PdfDocument(new PdfReader(src));
IEventListener listener = new MyImageRenderListener(outPath);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
for (int i = 1; i <= pdfDoc.getNumberOfPages(); i++) {
    parser.processPageContent(pdfDoc.getPage(i));
}
pdfDoc.close();

这篇关于从pdf文档中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆