使用iTextSharp从pdf提取图像及其名称 [英] Extract Image and its name from pdf using iTextSharp

查看:705
本文介绍了使用iTextSharp从pdf提取图像及其名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iTextSharp c#从目录pdf中提取图像及其名称.我能够从pdf中提取图像,但是要按照所附的屏幕截图提取其对应的图像名称并用该名称保存文件时很费劲.请找到下面的代码,并让我知道您的建议. 样本PDF : https://docdro.id/PwBsNR9

I am using iTextSharp c# to extract images and its name from catalog pdf. I Am able to extract images from pdf, but struggling with extracting its corresponding image name as per the attached screenshot and save the file with that name. Please find the code below and let me know your suggestions. Sample PDF: https://docdro.id/PwBsNR9

代码:

private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
{
    List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

    iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
    iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
    iTextSharp.text.pdf.PdfObject PDFObj = null;
    iTextSharp.text.pdf.PdfStream PDFStremObj = null;

    try
    {
        RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
        PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

        for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
        {
            PDFObj = PDFReaderObj.GetPdfObject(i);

            if ((PDFObj != null) && PDFObj.IsStream())
            {
                PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                }
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                    try
                    {

                        iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                 new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                        ImgList.Add(ImgPDF);

                    }
                    catch (Exception)
                    {

                    }
                }
            }
        }
        PDFReaderObj.Close();
    }
    catch (Exception ex)
    {
        throw new Exception(ex.Message);
    }
    return ImgList;
}

推荐答案

不幸的是,示例PDF未标记.因此,必须通过分析彼此之间的位置或利用内容流中的模式来尝试使标题文本和图像相关联.

Unfortunately the example PDF is not tagged. Thus, one has to otherwise try and associate title text and image, either by analyzing the location in respect to each other or by exploiting a pattern in the content streams.

在当前情况下,分析彼此之间的位置是可行的,因为标题总是(至少部分)画在匹配的图像上或文本正下方.因此,可以在第一遍中从页面中提取位置正确的文本,而在第二遍中从图像中提取位置,同时在图像区域中或正下方的先前提取的文本中查找标题.或者,可以先提取具有位置和大小的图像,然后再提取这些区域中的文本.

In the case at hand analyzing the location in respect to each other is feasible as the title always is (at least partially) drawn on the matching image or is the text right beneath it. Thus, one could in a first pass extract the text with position from a page and in a second one the images, at the same time looking for a title in the previously extracted text in the image area or right beneath. Alternatively one could first extract images with position and size and then extract the text in these areas.

但是内容流中也有某种模式:总是在绘制相应图像后立即在单个文本绘制指令中绘制标题.因此,也可以继续进行操作,并且一次提取图像和下一个绘制的文本作为关联的标题.

But there also is a certain pattern in the content streams: The titel is always drawn in a single text drawing instruction right after the corresponding image is drawn. Thus, one can also go ahead and in one pass extract images and the next drawn text as associated title.

这两种方法都可以使用iText解析器API来实现.例如,在采用后一种方法的情况下:首先,实现一个呈现所描述行为的渲染侦听器,即保存图像和以下文本:

Either approach can be implemented using the iText parser API. For example in case of the latter approach as follows: first, one implements a render listener that behaves as described, i.e. saves images and the following text:

internal class ImageWithTitleRenderListener : IRenderListener
{
    int imageNumber = 0;
    String format;
    bool expectingTitle = false;

    public ImageWithTitleRenderListener(String format)
    {
        this.format = format;
    }

    public void BeginTextBlock()
    { }

    public void EndTextBlock()
    { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        if (expectingTitle)
        {
            expectingTitle = false;
            File.WriteAllText(string.Format(format, imageNumber, "txt"), renderInfo.GetText());
        }
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        imageNumber++;
        expectingTitle = true;

        PdfImageObject imageObject = renderInfo.GetImage();

        if (imageObject == null)
        {
            Console.WriteLine("Image {0} could not be read.", imageNumber);
        }
        else
        {
            File.WriteAllBytes(string.Format(format, imageNumber, imageObject.GetFileType()), imageObject.GetImageAsBytes());
        }
    }
}

然后使用该渲染监听器解析文档页面:

Then one parses the document pages using that render listener:

using (PdfReader reader = new PdfReader(@"EVERMOTION ARCHMODELS VOL.78.pdf"))
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageWithTitleRenderListener listener = new ImageWithTitleRenderListener(@"EVERMOTION ARCHMODELS VOL.78-{0:D3}.{1}");
    for (var i = 1; i <= reader.NumberOfPages; i++)
    {
        parser.ProcessContent(i, listener);
    }
}

这篇关于使用iTextSharp从pdf提取图像及其名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆