iTextSharp:拆分页面大小等于文件大小 [英] iTextSharp: Split pages size equals file size

查看:128
本文介绍了iTextSharp:拆分页面大小等于文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我分割大PDF文件(144 mb)的方法:

public int SplitAndSave(string inputPath, string outputPath)
{
    FileInfo file = new FileInfo(inputPath);
    string name = file.Name.Substring(0, file.Name.LastIndexOf("."));

    using (PdfReader reader = new PdfReader(inputPath))
    {
        for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
        {
            string filename = pagenumber.ToString() + ".pdf";

            Document document = new Document();
            PdfCopy copy = new PdfCopy(document, new FileStream(outputPath + "\\" + filename, FileMode.Create));

            document.Open();

            copy.AddPage(copy.GetImportedPage(reader, pagenumber));

            document.Close();
        }
        return reader.NumberOfPages;
    }
}

对于大多数PDF(小尺寸,我猜是旧格式),一切正常.但是,对于较大的页面(可能正在使用诸如refstreams之类的东西进行更好的压缩),拆分页面将以一页打开,但其大小等于原始PDF的大小.我该怎么办?

解决方案

对于您的文档 Top_Gear_Magazine_2012_09. pdf 的原因确实是我提到的原因:所有页面都将对象 2 0 R 称为/资源,而字典中的引用是 2 0 obj 依次引用PDF中的所有图像.

要将该文档拆分为仅包含所需图像的部分文档,您应先对文档进行预处理,方法是先找出哪些图像属于哪些页面,然后为所有页面创建单独的/Resources 字典./p>

由于您已经在此上下文中使用了iText,因此您也可以使用它来查找哪些图像对于哪些页面是必需的.使用iText parser程序包可以使用RenderListener实现的方式逐页解析PDF,该实现的RenderImage方法仅记住当前页面上使用了哪些图像对象. (作为一种特殊的用法,iText隐藏了所讨论的图像XObject的名称;不过,您获得了间接对象,并且可以查询足以满足下一步要求的对象和世代号.)

第二步,在PdfStamper中打开文档,然后遍历页面.对于每个页面,您都检索/Resources 字典并进行复制,但是仅复制那些引用了您在第一步中为相应页面记住的对象编号和生成方式的图像对象之一的XObjects引用.最后,将缩小的副本设置为相关页面的/Resources 字典.

生成的PDF应该可以很好地分割.

PS 最近,iText邮件列表中出现了一个非常相似的问题. 在该线程中,此处给出的解决方案已包含进行了改进,以解决因iText隐藏xobject名称而造成的困难,我现在建议通过使用其他ContentOperator代替"Do"(此处为Java版本)进行干预,以防止名称丢失. >

class Do implements ContentOperator 
{ 
    public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException 
    { 
        PdfName xobjectName = (PdfName)operands.get(0); 
        names.add(xobjectName); 
    } 

    final List<PdfName> names = new ArrayList<PdfName>(); 
} 

此内容运算符仅收集使用的xobject的名称,即为给定页面保留的xobject资源.

Here is how I split a large PDF (144 mb):

public int SplitAndSave(string inputPath, string outputPath)
{
    FileInfo file = new FileInfo(inputPath);
    string name = file.Name.Substring(0, file.Name.LastIndexOf("."));

    using (PdfReader reader = new PdfReader(inputPath))
    {
        for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
        {
            string filename = pagenumber.ToString() + ".pdf";

            Document document = new Document();
            PdfCopy copy = new PdfCopy(document, new FileStream(outputPath + "\\" + filename, FileMode.Create));

            document.Open();

            copy.AddPage(copy.GetImportedPage(reader, pagenumber));

            document.Close();
        }
        return reader.NumberOfPages;
    }
}

For most PDFs (small size, and I guess old format), all works fine. But for a bigger one (that perhaps are using something like refstreams for better compression), the split pages open as one page, but its size is equal to the original PDF's size. What can I do?

解决方案

In case of your document Top_Gear_Magazine_2012_09.pdf the reason is indeed the one I mentioned: All pages refer to object 2 0 R as their /Resources, and the dictionary in 2 0 obj in turn references all images in the PDF.

To split that document into partial documents containing only the images required, you should preprocess the document by first finding out which images belong to which pages and then creating individual /Resources dictionaries for all pages.

As you already use iText in this context, you can also use it to find out which images are required for which pages. Use the iText parser package to initially parse the PDF page by page using a RenderListener implementation whose RenderImage method simply remembers which image objects are used on the current page. (As a special twist, iText hides the name of the image XObject in question; you get the indirect object, though, and can query its object and generation number which suffices for the next step.)

In a second step, you open the document in a PdfStamperand iterate over the pages. For each page you retrieve the /Resources dictionary and copy it, but only copy those XObjects references referencing one of the image objects whose object number and generation you remembered for the respective page during the first step. Finally you set the diminished copy as the /Resources dictionary of the page in question.

The resulting PDF should split just fine.

PS A very similar issue recently came up on the iText mailing list. In that thread the solution recipe given here has been improved, to get around the difficulties caused by iText hiding the xobject name, I now would propose to intervene before the name is lost by using a different ContentOperator for "Do", here the Java version:

class Do implements ContentOperator 
{ 
    public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException 
    { 
        PdfName xobjectName = (PdfName)operands.get(0); 
        names.add(xobjectName); 
    } 

    final List<PdfName> names = new ArrayList<PdfName>(); 
} 

This content operator simply collects the names of the used xobjects, i.e. the xobject resources to keep for the given page.

这篇关于iTextSharp:拆分页面大小等于文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆