使用iTextSharp提取图像 [英] Extract images using iTextSharp

查看:133
本文介绍了使用iTextSharp提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用此代码取得巨大成功,以便在PDF的每个页面中找到第一个图像。但是,由于未知原因,它现在无法使用某些新PDF。我已经使用了其他工具(Datalogics等),这些工具可以很好地利用这些新的PDF来提取图像。但是,如果我可以使用iTextSharp,我不想购买Datalogics或任何工具。有人能告诉我为什么这段代码没有找到PDF中的图片吗?

I have been using this code with great success to pull out the first image found in each page of a PDF. However, it is now not working with some new PDFs for an uknown reason. I have used other tools (Datalogics, etc) that do pull out the images fine with these new PDFs. However, I do not want to buy Datalogics or any tool if I can use iTextSharp. Can anybody tell me why this code is not finding the images in the PDF?

知道:我的PDF每页只有1张图片,没有别的。

Knowns: my PDFs only have 1 image per page and nothing else.

using iTextSharp.text;
using iTextSharp.text.pdf;
...
public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
{
    // NOTE:  This will only get the first image it finds per page.
    PdfReader pdf = new PdfReader(sourcePdf);
    RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);

    try
    {
        for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
        {
            PdfDictionary pg = pdf.GetPageN(pageNumber);
            PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));

            PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
            if (xobj != null)
            {
                foreach (PdfName name in xobj.Keys)
                {
                    PdfObject obj = xobj.Get(name);
                    if (obj.IsIndirect())
                    {
                        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
                        if (PdfName.IMAGE.Equals(type))
                        {
                            int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                            PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                            PdfStream pdfStrem = (PdfStream)pdfObj;
                            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                            if ((bytes != null))
                            {
                                using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                                {
                                    memStream.Position = 0;
                                    System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                                    // must save the file while stream is open.
                                    if (!Directory.Exists(outputPath))
                                        Directory.CreateDirectory(outputPath);

                                    string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                                    System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                                    parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
                                    System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
                                    img.Save(path, jpegEncoder, parms);
                                    break;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    catch
    {
        throw;
    }
    finally
    {
        pdf.Close();
        raf.Close();
    }
}


推荐答案

我发现我的问题是我没有递归地搜索表单和组中的图像。基本上,原始代码只能找到嵌入在pdf文档根目录下的图像。以下是修订后的方法以及递归搜索页面中图像的新方法(FindImageInPDFDictionary)。注意:仅支持JPEG和非压缩图像的缺陷仍然适用。有关修复这些缺陷的选项,请参阅R Ubben的代码。 HTH某人。

I found that my problem was that I was not recursively searching inside of forms and groups for images. Basically, the original code would only find images that were embedded at the root of the pdf document. Here is the revised method plus a new method (FindImageInPDFDictionary) that recursively searches for images in the page. NOTE: the flaws of only supporting JPEG and non-compressed images still applies. See R Ubben's code for options to fix those flaws. HTH someone.

    public static void ExtractImagesFromPDF(string sourcePdf, string outputPath)
    {
        // NOTE:  This will only get the first image it finds per page.
        PdfReader pdf = new PdfReader(sourcePdf);
        RandomAccessFileOrArray raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sourcePdf);

        try
        {
            for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
            {
                PdfDictionary pg = pdf.GetPageN(pageNumber);

                // recursively search pages, forms and groups for images.
                PdfObject obj = FindImageInPDFDictionary(pg);
                if (obj != null)
                {

                    int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                    PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                    PdfStream pdfStrem = (PdfStream)pdfObj;
                    byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                    if ((bytes != null))
                    {
                        using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                        {
                            memStream.Position = 0;
                            System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                            // must save the file while stream is open.
                            if (!Directory.Exists(outputPath))
                                Directory.CreateDirectory(outputPath);

                            string path = Path.Combine(outputPath, String.Format(@"{0}.jpg", pageNumber));
                            System.Drawing.Imaging.EncoderParameters parms = new System.Drawing.Imaging.EncoderParameters(1);
                            parms.Param[0] = new System.Drawing.Imaging.EncoderParameter(System.Drawing.Imaging.Encoder.Compression, 0);
                            System.Drawing.Imaging.ImageCodecInfo jpegEncoder = Utilities.GetImageEncoder("JPEG");
                            img.Save(path, jpegEncoder, parms);
                        }
                    }
                }
            }
        }
        catch
        {
            throw;
        }
        finally
        {
            pdf.Close();
            raf.Close();
        }


    }

     private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
    {
        PdfDictionary res =
            (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));


        PdfDictionary xobj =
          (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
        if (xobj != null)
        {
            foreach (PdfName name in xobj.Keys)
            {

                PdfObject obj = xobj.Get(name);
                if (obj.IsIndirect())
                {
                    PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                    PdfName type =
                      (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                    //image at the root of the pdf
                    if (PdfName.IMAGE.Equals(type))
                    {
                        return obj;
                    }// image inside a form
                    else if (PdfName.FORM.Equals(type))
                    {
                        return FindImageInPDFDictionary(tg);
                    } //image inside a group
                    else if (PdfName.GROUP.Equals(type))
                    {
                        return FindImageInPDFDictionary(tg);
                    }

                }
            }
        }

        return null;

    }

这篇关于使用iTextSharp提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆