以正确的顺序提取pdf图像iTextSharp [英] Extracting pdf images in a correct order iTextSharp

查看:139
本文介绍了以正确的顺序提取pdf图像iTextSharp的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从PDF文件中提取图像,但我确实需要以正确的顺序获取图像才能获得正确的图像。

I'm trying to extract images from a PDF File, but I really need to have it at the correct order to get the correct image.

    static void Main(string[] args)
    {
        string filename = "D:\\910723575_marca_coletiva.pdf";

        PdfReader pdfReader = new PdfReader(filename);

        var imagemList = ExtraiImagens(pdfReader);

        // converter byte[] para um bmp
        List<Bitmap> bmpSrcList = new List<Bitmap>();
        IList<byte[]> imagensProcessadas = new List<byte[]>();

        foreach (var imagem in imagemList)
        {

            System.Drawing.ImageConverter converter = new System.Drawing.ImageConverter();
            try
            {
                Image img = (Image)converter.ConvertFrom(imagem);
                ConsoleWriteImage(img);
                imagensProcessadas.Add(imagem);
            }
            catch (Exception)
            {
                continue;
            }

        }

        System.Console.ReadLine();
    }

    public static void ConsoleWriteImage(Image img)
    {
        int sMax = 39;
        decimal percent = Math.Min(decimal.Divide(sMax, img.Width), decimal.Divide(sMax, img.Height));
        Size resSize = new Size((int)(img.Width * percent), (int)(img.Height * percent));
        Func<System.Drawing.Color, int> ToConsoleColor = c =>
        {
            int index = (c.R > 128 | c.G > 128 | c.B > 128) ? 8 : 0;
            index |= (c.R > 64) ? 4 : 0;
            index |= (c.G > 64) ? 2 : 0;
            index |= (c.B > 64) ? 1 : 0;
            return index;
        };
        Bitmap bmpMin = new Bitmap(img, resSize.Width, resSize.Height);
        Bitmap bmpMax = new Bitmap(img, resSize.Width * 2, resSize.Height * 2);
        for (int i = 0; i < resSize.Height; i++)
        {
            for (int j = 0; j < resSize.Width; j++)
            {
                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMin.GetPixel(j, i));
                Console.Write("██");
            }

            Console.BackgroundColor = ConsoleColor.Black;
            Console.Write("    ");

            for (int j = 0; j < resSize.Width; j++)
            {
                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2, i * 2));
                Console.BackgroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2, i * 2 + 1));
                Console.Write("▀");

                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2 + 1, i * 2));
                Console.BackgroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2 + 1, i * 2 + 1));
                Console.Write("▀");
            }
            System.Console.WriteLine();
        }
    }

    public static IList<byte[]> ExtraiImagens(PdfReader pdfReader) 
    {
        var data = new byte[] { };

        IList<byte[]> imagensList = new List<byte[]>();

        for (int numPag = 1; numPag <= 3; numPag++)
        //for (int numPag = 1; numPag <= pdfReader.NumberOfPages; numPag++)
        {
            var pg = pdfReader.GetPageN(numPag);
            var res = PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)) as PdfDictionary;
            var xobj = PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)) as PdfDictionary;
            if (xobj == null) continue;

            var keys = xobj.Keys;
            if (keys == null) continue;

            PdfObject obj = null;
            PdfDictionary tg = null;

            for (int key = 0; key < keys.Count; key++)
            {
                obj = xobj.Get(keys.ElementAt(key));

                if (!obj.IsIndirect()) continue;

                tg = PdfReader.GetPdfObject(obj) as PdfDictionary;

                obj = xobj.Get(keys.ElementAt(key));
                if (!obj.IsIndirect()) continue;
                tg = PdfReader.GetPdfObject(obj) as PdfDictionary;

                var type = PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)) as PdfName;
                if (!PdfName.IMAGE.Equals(type)) continue;

                int XrefIndex = (obj as PRIndirectReference).Number;
                var pdfStream = pdfReader.GetPdfObject(XrefIndex) as PRStream;

                data = PdfReader.GetStreamBytesRaw(pdfStream);

                imagensList.Add(PdfReader.GetStreamBytesRaw(pdfStream));
            }
        }

        return imagensList;
    }
}

方法ConsoleWriteImage仅用于打印图像控制台和我用它来研究iTextSharp根据我的代码为我检索它的顺序的行为。

The method ConsoleWriteImage is only to print the image at the console and I used it to study the behavior of the order that iTextSharp was retrieving it for me , based on my code.

任何帮助?

推荐答案

不幸的是OP还没有解释什么正确的顺序是 - 这不是不言自明的,因为可能存在PDF的某些方面对于程序来说并不明显,仅适用于查看呈现的PDF的人类读者。

Unfortunately the OP has not explained what the correct order is - this is not self-explanatory because there might be certain aspects of a PDF which are not obvious for a program, merely for a human reader viewing the rendered PDF.

但至少,OP可能希望逐页获取他的图像。这显然不是他当前的代码所提供的:他的代码扫描PDF中的对象的整个基础以用于图像对象,因此他将获得图像对象,但是顺序可能是完全随机的;特别是他甚至可能获得PDF中包含但未在其任何页面上使用的图像...

At least, though, it is likely that the OP wants to get his images on a page-by-page basis. This obviously is not what his current code provides: His code scans the whole base of objects inside the PDF for image objects, so he will get image objects, but the order may be completely random; in particular he may even get images contained in the PDF but not used on any of its pages...

按页面顺序检索图像(和只有实际使用的图像),应该使用解析器框架,例如

To retrieve images on a page-by-page order (and only images actually used), one should use the parser framework, e.g.

PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
for (int i = 1; i <= reader.NumberOfPages; i++) {
  parser.ProcessContent(i, listener);
} 
// Process images in the List listener.MyImages
// with names in listener.ImageNames

(摘自 ExtractImages.cs iTextSharp示例)

(Excerpt from the ExtractImages.cs iTextSharp example)

其中 MyImageRenderListener 定义为收集图像:

public class MyImageRenderListener : IRenderListener {
    /** the byte array of the extracted images */
    private List<byte[]> _myImages;
    public List<byte[]> MyImages {
      get { return _myImages; }
    }
    /** the file names of the extracted images */
    private List<string> _imageNames;
    public List<string> ImageNames { 
      get { return _imageNames; }
    } 

    public MyImageRenderListener() {
      _myImages = new List<byte[]>();
      _imageNames = new List<string>();
    }

    [...]

    public void RenderImage(ImageRenderInfo renderInfo) {
      try {
        PdfImageObject image = renderInfo.GetImage();
        if (image == null || image.GetImageBytesType() == PdfImageObject.ImageBytesType.JBIG2) 
          return;

        _imageNames.Add(string.Format("Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType() ) );
        _myImages.Add(image.GetImageAsBytes());
      }
      catch
      {
      }
    }

    [...]      
}

(摘自 MyImageRenderListener.cs iTextSharp示例)

(Excerpt from MyImageRenderListener.cs iTextSharp example)

ImageRenderInfo renderInfo 此外还包含有关页面上图像的位置和方向的信息,这可能有助于推断OP所追求的正确的顺序

The ImageRenderInfo renderInfo furthermore also contains information on location and orientation of the image on the page in question which might help to deduce the correct order the OP is after.

这篇关于以正确的顺序提取pdf图像iTextSharp的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆