iTextSharp的在.NET中,提取图像code,我不能得到这个工作 [英] ITEXTSHARP in .net, Extract image code, I can't get this to work

查看:183
本文介绍了iTextSharp的在.NET中,提取图像code,我不能得到这个工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我期待简单地从PDF中提取的所有图像。我发现了一些code,看起来像它正是我需要的。

 私人小组getAllImages(BYVAL字典作为pdf.PdfDictionary,BYVAL图片方式列表(字节()),BYVAL DOC作为pdf.PdfReader)
昏暗的水库由于pdf.PdfDictionary = CTYPE(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)),pdf.PdfDictionary)
昏暗xobj作为pdf.PdfDictionary = CTYPE(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)),pdf.PdfDictionary)

如果xobj状态并没有任何然后
    对于每个名称作为pdf.PdfName在xobj.Keys
        昏暗OBJ时pdf.PdfObject = xobj.Get(名)
        如果(obj.IsIndirect)然后
            昏暗TG作为pdf.PdfDictionary = CTYPE(pdf.PdfReader.GetPdfObject(OBJ),pdf.PdfDictionary)
            昏暗的亚型pdf.PdfName = CTYPE(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)),pdf.PdfName)
            如果pdf.PdfName.IMAGE.Equals(亚型)然后
                昏暗xrefIdx作为整数= CTYPE(OBJ,pdf.PRIndirectReference).Number
                昏暗pdfObj作为pdf.PdfObject = doc.GetPdfObject(xrefIdx)
                昏暗str作为pdf.PdfStream = CTYPE(pdfObj,pdf.PdfStream)
                昏暗的字节为字节()= pdf.PdfReader.GetStreamBytesRaw(CTYPE(STR,pdf.PRStream))

                昏暗的过滤器的String = tg.Get(pdf.PdfName.FILTER)的ToString
                昏暗的宽度为String = tg.Get(pdf.PdfName.WIDTH)的ToString
                昏暗的高度的String = tg.Get(pdf.PdfName.HEIGHT)的ToString
                昏暗的BPP作为字符串= tg.Get(pdf.PdfName.BITSPERCOMPONENT)的ToString

                如果过滤器=/ FlateDe code接
                    字节= pdf.PdfReader.FlateDe code(字节,真)
                    昏暗的PixelFormat作为System.Drawing.Imaging.PixelFormat
                    选择案例Integer.Parse(BPP)
                        情况1
                            的PixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
                        案例24
                            的PixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
                        案例否则
                            抛出新的异常(未知像素格式+ BPP)
                    最终选择
                    昏暗BMP作为新System.Drawing.Bitmap(Int32.Parse(宽),Int32.Parse(高度),像素格式)
                    暗淡的BMD作为System.Drawing.Imaging.BitmapData = bmp.LockBits(新泽西System.Drawing.Rectangle(0,0,Int32.Parse(宽度),Int32.Parse(高度)),System.Drawing.Imaging.ImageLockMode.WriteOnly ,的PixelFormat)
                    Marshal.Copy(字节,0,bmd.Scan0,bytes.Length)
                    bmp.UnlockBits(BMD)
                    使用MS作为新的MemoryStream
                        bmp.Save(MS,System.Drawing.Imaging.ImageFormat.Png)
                        字节= ms.GetBuffer
                    结束使用
                结束如果
                images.Add(字节)
            elseif的pdf.PdfName.FORM.Equals(亚型)或pdf.PdfName.GROUP.Equals(亚型)然后
                getAllImages(TG,图片,文档)
            结束如果
        结束如果
    下一个
结束如果END SUB
 

现在我的问题很简单,我怎么能叫这个,我不知道该怎么设置字典变量或图像列表??

因此​​,在essance如果我位于C PDF文件:包含图像\ TEMP \的test.pdf,我怎么把这个

 昏暗x As中新的FileStream(C:\图像\的test.pdf,FileMode.Open)
    昏暗的阅读器作为新iTextSharp.text.pdf.PdfReader(X)
    getAllImages(?????,??????,读卡器)
 

解决方案

的方式这个人写了这个方法看起来奇怪,如果你不明白的PDF文件和/或iTextSharp的的内部。该方法有三个参数,第一个是 PdfDictionary 您获得通过调用 GetPageN(整数)在你的每一个页面。第二个是一个通用的清单,你需要调用此之前初始化你自己的。此方法的目的是要调用中的每一页在PDF并且每个呼叫将图像附加到该列表中循环。最后一个参数你已经明白了。

因此​​,这里的code调用这个方法:

 ''//源文件中读取图像
昏暗的InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop),FileWithImages.pdf)

''//列表转储映像成
昏暗的图像作为新的列表(字节())

''//主PDF阅读器
昏暗阅读器作为新PdfReader(与InputFile)

''// PDF中的总页数
昏暗PageCount = Reader.NumberOfPages

''//循环遍历每个页面(第一页是其中之一,不为零)
对于i = 1到PageCount
    getAllImages(Reader.GetPageN(我),图像阅读器)
下一个
 

非常,非常重要 - iTextSharp的是不是一个PDF渲染器,它是一个PDF作曲家。这意味着它知道它具有图像状物体,但它并不一定很了解他们。要说它的另一种方式,iTextSharp的人都知道一个给定的字节数组重新presents东西的PDF标准说是一个形象,但它不知道也不关心它是否为JPEG,TIFF,BMP或别的东西。所有iTextSharp的关心的是,这个对象有几个标准的属性就可以操纵像X,Y和有效的宽度和高度。 PDF渲染器将处理转换到字节的实际图像的作业。在此可以,是PDF渲染器让你的工作弄清楚如何处理的字节数组作为图像。

具体而言,您将在该方法中看到,有这样一行:

 如果filter =/ FlateDe code接
 

这是通常写为选择的情况下开关语句来处理<$ C $的各种值C>过滤器。该方法要引用仅处理 FlateDe code 这是pretty的共同但实际上有10个标准过滤器,例如 CCITTFaxDe code JBIG2De code DCTDe code (PDF规格7.4 - 过滤器)。您应该修改方法包括某种形式的捕捞(一个否则默认的情况下),这样你在至少知道你是不是设置来处理图像的。

此外, / FlatDe code 部分中,你会看到这一行:

 选择案例Integer.Parse(BPP)
 

此读取与该通知呈示多少比特解析时,应使用每种颜色的图像对象相关联的属性。再次,你是在这种情况下,PDF渲染器,以便其由你来找出该怎么做。您只引用的code占单色(1 BPP)或真彩色(24 BPP)的图像,但其他人绝对应该占据,尤其是8 BPP。

所以,总结一下,希望在code为你的作品原样,但如果它抱怨很多和/或错过图像不感到惊讶。提取的图像实际上是非常令人沮丧的时候。如果遇到问题从这里开始了新的问题,引用这一项,希望我们能帮助你更!

I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need

Private Sub getAllImages(ByVal dict As pdf.PdfDictionary, ByVal images As List(Of Byte()), ByVal doc As pdf.PdfReader)
Dim res As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)), pdf.PdfDictionary)
Dim xobj As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)), pdf.PdfDictionary)

If xobj IsNot Nothing Then
    For Each name As pdf.PdfName In xobj.Keys
        Dim obj As pdf.PdfObject = xobj.Get(name)
        If (obj.IsIndirect) Then
            Dim tg As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(obj), pdf.PdfDictionary)
            Dim subtype As pdf.PdfName = CType(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)), pdf.PdfName)
            If pdf.PdfName.IMAGE.Equals(subtype) Then
                Dim xrefIdx As Integer = CType(obj, pdf.PRIndirectReference).Number
                Dim pdfObj As pdf.PdfObject = doc.GetPdfObject(xrefIdx)
                Dim str As pdf.PdfStream = CType(pdfObj, pdf.PdfStream)
                Dim bytes As Byte() = pdf.PdfReader.GetStreamBytesRaw(CType(str, pdf.PRStream))

                Dim filter As String = tg.Get(pdf.PdfName.FILTER).ToString
                Dim width As String = tg.Get(pdf.PdfName.WIDTH).ToString
                Dim height As String = tg.Get(pdf.PdfName.HEIGHT).ToString
                Dim bpp As String = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString

                If filter = "/FlateDecode" Then
                    bytes = pdf.PdfReader.FlateDecode(bytes, True)
                    Dim pixelFormat As System.Drawing.Imaging.PixelFormat
                    Select Case Integer.Parse(bpp)
                        Case 1
                            pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
                        Case 24
                            pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
                        Case Else
                            Throw New Exception("Unknown pixel format " + bpp)
                    End Select
                    Dim bmp As New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
                    Dim bmd As System.Drawing.Imaging.BitmapData = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
                    Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
                    bmp.UnlockBits(bmd)
                    Using ms As New MemoryStream
                        bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png)
                        bytes = ms.GetBuffer
                    End Using
                End If
                images.Add(bytes)
            ElseIf pdf.PdfName.FORM.Equals(subtype) Or pdf.PdfName.GROUP.Equals(subtype) Then
                getAllImages(tg, images, doc)
            End If
        End If
    Next
End If End Sub

Now my issue is simply, how can I call this, I do not know what to set the dict variable to or the images list??

So in essance if I have a PDF located at C:\temp\test.pdf that contains images, how do I call this?

    Dim x As New FileStream("C:\image\test.pdf", FileMode.Open)
    Dim reader As New iTextSharp.text.pdf.PdfReader(x)
    getAllImages(?????, ?????? ,reader)

解决方案

The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary which you obtain by calling GetPageN(Integer) on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.

So here's the code to call this method:

''//Source file to read images from
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")

''//List to dump images into
Dim Images As New List(Of Byte())

''//Main PDF reader
Dim Reader As New PdfReader(InputFile)

''//Total number of pages in the PDF
Dim PageCount = Reader.NumberOfPages

''//Loop through each page (first page is one, not zero)
For I = 1 To PageCount
    getAllImages(Reader.GetPageN(I), Images, Reader)
Next

VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.

Specifically, you'll see in that method that there's a line that reads:

If filter = "/FlateDecode" Then

This is often written as a select case or switch statement to process the various values of filter. The method you are referencing only handles FlateDecode which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode, JBIG2Decode and DCTDecode (PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else or Default case) so that you are at least aware of images you aren't set up to process.

Additionally, within the /FlatDecode section you'll see this line:

Select Case Integer.Parse(bpp)

This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.

So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!

这篇关于iTextSharp的在.NET中,提取图像code,我不能得到这个工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆