通过iTextSharp 4.1.6.0从Pdf中提取图像 [英] Extract Images from Pdf via iTextSharp 4.1.6.0

查看:1231
本文介绍了通过iTextSharp 4.1.6.0从Pdf中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好(你也 Bruno :) :)。
我正在使用移植到 Xamarin .Android的iTextSharp 4.1.6.0。

出于某种原因,我需要从pdf中提取图像。

创建了太多的例子,但似乎我的案例不能接受它们,因为有些类(例如:

ImageCodeInfo ImageRenderInfo System.Drawing.Imaging.EncoderParameters PdfImageObject 等等,不存在)。



但是一个例子看起来不错,这是它:

  void ExtractJpeg(字符串文件)
{
var dir1 = Path.GetDirectoryName(file);
var fn = Path.GetFileNameWithoutExtension(file);
var dir2 = Path.Combine(dir1,fn);
if(!Directory.Exists(dir2))Directory.CreateDirectory(dir2);

var pdf = new PdfReader(file);
int n = pdf.NumberOfPages;
for(int i = 1; i< = n; i ++)
{
var pg = pdf.GetPageN(i);
var res = PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES))as PdfDictionary;
var xobj = PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT))as PdfDictionary;
如果(xobj == null)继续;

var keys = xobj.Keys;
if(keys.Count == 0)继续;

var obj = xobj.Get(keys.ElementAt(0));
if(!obj.IsIndirect())继续;

var tg = PdfReader.GetPdfObject(obj)as PdfDictionary;
var type = PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE))as PdfName;
if(!PdfName.IMAGE.Equals(type))继续;

int XrefIndex =(obj as PRIndirectReference).Number;
var pdfStream = pdf.GetPdfObject(XrefIndex)作为PRStream;
var data = PdfReader.GetStreamBytesRaw(pdfStream);
var jpeg = Path.Combine(dir2,string.Format({0:0000} .jpg,i));
File.WriteAllBytes(jpeg,data);
}
}

此行存在问题:

  var obj = xobj.Get(keys.ElementAt(0)); 

错误日志:


方法
的类型参数`System.Linq.ParallelEnumerable.ElementAt(此
System.Linq.ParallelQuery,int)'无法从
用法中推断出来。尝试明确指定类型参数


我不知道如何制定解决方法。有人可以解释一下吗?



此外,我想知道是否存在另一种从pdf中提取图像的方法。

谢谢!!

解决方案

首先,关于从旧的,过时的和不再正式支持的软件升级的强制性演讲:



请升级到最新版本的iTextSharp。我知道您要说您不能使用iText的新许可证,但请阅读他们的销售常见问题解答 ,特别是为什么我不应该......一节,其中涉及4.1.6。请记住,在大多数国家/地区,接受许可证实际上会将您纳入法律合同中,因此我也会让具有法律经验的人员阅读该合同。既然你说你正在使用Xamarin,我认为你也将它提交到商店,所以这更重要,因为问题可以快速增加。



<另外,有一个新版本的PDF很快就会推出,你也可能希望能够支持它。



Second ,您的代码做出了一个巨大而错误的假设,即PDF中的所有图像都是JPEG。请参阅这篇文章这篇文章对它进行一些讨论。也许你的PDF都是JPEG格式,所以这对你有用,但很有可能会破坏明天。



第三次,我可以得到 ElementAt 使用 ICollection 。我不知道我是否错过了某个扩展或在某处使用,但似乎你从一个五年前的帖子中复制了代码这里来自一个六年的帖子 here 。我也不确定为什么还需要第一元素,这很奇怪。解决方案是只是循环键,而不是试图只是明确地抓住一个。而不是:

  var obj = xobj.Get(keys.ElementAt(0)); 
// ... ...
File.WriteAllBytes(jpeg,data);

循环每个键:

  foreach(密钥中的PdfName k){
var obj = xobj.Get(k);
// ... ...
File.WriteAllBytes(jpeg,data);
}

这个小小的变化会让我们大家都哭,但它应该在最少工作。


Hello all(and you Bruno too :) ).
I'm using iTextSharp 4.1.6.0 that ported for Xamarin.Android.
For some reason i need to extract images from pdf.
I founded too much examples,but seems they are not acceptable for my case,because some classes(like :
ImageCodeInfo , ImageRenderInfo , System.Drawing.Imaging.EncoderParameters , PdfImageObject and etc,doesn't exist).

But one example looks fine,here is it:

void ExtractJpeg(string file)
{
    var dir1 = Path.GetDirectoryName(file);
    var fn = Path.GetFileNameWithoutExtension(file);
    var dir2 = Path.Combine(dir1, fn);
    if (!Directory.Exists(dir2)) Directory.CreateDirectory(dir2);

    var pdf = new PdfReader(file);
    int n = pdf.NumberOfPages;
    for (int i = 1; i <= n; i++)
    {
        var pg = pdf.GetPageN(i);
        var res = PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)) as PdfDictionary;
        var xobj = PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)) as PdfDictionary;
        if (xobj == null) continue;

        var keys = xobj.Keys;
        if (keys.Count == 0) continue;

        var obj = xobj.Get(keys.ElementAt(0));
        if (!obj.IsIndirect()) continue;

        var tg = PdfReader.GetPdfObject(obj) as PdfDictionary;
        var type = PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)) as PdfName;
        if (!PdfName.IMAGE.Equals(type)) continue;

        int XrefIndex = (obj as PRIndirectReference).Number;
        var pdfStream = pdf.GetPdfObject(XrefIndex) as PRStream;
        var data = PdfReader.GetStreamBytesRaw(pdfStream);
        var jpeg = Path.Combine(dir2, string.Format("{0:0000}.jpg", i));
        File.WriteAllBytes(jpeg, data);
    }
}    

And problem in this line :

var obj = xobj.Get(keys.ElementAt(0));  

Error log:

The type arguments for method `System.Linq.ParallelEnumerable.ElementAt(this System.Linq.ParallelQuery, int)' cannot be inferred from the usage. Try specifying the type arguments explicitly

I have no idea how to make workaround. Can some explain me ?

Also,i would like to know if exist another method to extract image from pdf.
Thanks!!

解决方案

First, the obligatory speech about upgrading from old, obsolete and no longer officially supported software:

Please upgrade to the most recent version of iTextSharp. I know that you're going to say that you can't use iText's new license but please read their sales FAQ, specifically the "Why shouldn't I use..." section which addresses 4.1.6. Please remember that in most countries, accepting the license actually enters you into a legal contract so I would also have someone with legal experience read that, too. Since you say that you are using Xamarin I'm thinking that you are submitting this to a store, too, so this is even more important because the problems can multiply very fast.

Also, there's a new version of PDF coming out pretty soon and you'll probably want to be on track to support that, too.

Second, your code makes a giant and incorrect assumption that all images in a PDF are JPEGs. See this post and this post for a bit of a discussion on it. Maybe your PDFs are all JPEGs so this works for you but there's a good chance that this will break "tomorrow".

Third, I can't get ElementAt to work with an ICollection. I don't know if I'm missing an extension or a using somewhere but it appears that you copied the code from a five year old post here that came from a six year old post here. I'm also not sure why the "first" element is needed anyway, that's weird. The solution is to just loop over the keys instead of trying to just explicitly grab one. Instead of:

var obj = xobj.Get(keys.ElementAt(0));
//...
File.WriteAllBytes(jpeg, data);

Loop over each key:

foreach (PdfName k in keys) {
    var obj = xobj.Get(k);
    //...
    File.WriteAllBytes(jpeg, data);
}

This small change will make us all cry but it should make extraction of images at least work.

这篇关于通过iTextSharp 4.1.6.0从Pdf中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆