itextsharp:拆分页面大小等于文件大小 [英] itextsharp: splitted pages size equals file size

查看:117
本文介绍了itextsharp:拆分页面大小等于文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我如何拆分大pdf(144 mb)

Here is how I split large pdf (144 mb)

    public int SplitAndSave(string inputPath, string outputPath)
    {
        FileInfo file = new FileInfo(inputPath);
        string name = file.Name.Substring(0, file.Name.LastIndexOf("."));

        using (PdfReader reader = new PdfReader(inputPath))
        {

            for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
            {
                string filename = pagenumber.ToString() + ".pdf";

                Document document = new Document();
                PdfCopy copy = new PdfCopy(document, new FileStream(outputPath + "\\" + filename, FileMode.Create));

                document.Open();

                copy.AddPage(copy.GetImportedPage(reader, pagenumber));

                document.Close();
            }
            return reader.NumberOfPages;
        }

    }

对于大多数pdf(小尺寸,我想旧的格式)一切正常。但对于大的(可能使用像refstreams这样的东西......以获得最佳压缩),spliited页面打开为一页,但其大小等于pdf大小。我该怎么办?

For most pdfs (little size, and I guess old format) all works fine. But for big one (that perhaps using something like refstreams... for best compression) spliited pages opens as one page, but its size is equals pdf size. What can I do?

推荐答案

如果您的文件 Top_Gear_Magazine_2012_09.pdf 原因确实是我提到的:所有页面都将对象 2 0 R 称为 / Resources 2 0 obj 中的字典依次引用PDF中的所有图像。

In case of your document Top_Gear_Magazine_2012_09.pdf the reason is indeed the one I mentioned: All pages refer to object 2 0 R as their /Resources, and the dictionary in 2 0 obj in turn references all images in the PDF.

将该文档拆分为包含部分文档只有所需的图像,您应该首先找出哪些图像属于哪些页面然后为所有页面创建单独的 / Resources 词典来预处理文档。

To split that document into partial documents containing only the images required, you should preprocess the document by first finding out which images belong to which pages and then creating individual /Resources dictionaries for all pages.

由于您已在此上下文中使用iText,您还可以使用它来找出哪些页面需要哪些图像。使用iText 解析器包最初使用 RenderListener 实现逐页解析PDF RenderImage 方法只记住当前页面上使用的图像对象。 (作为一种特殊的转折,iText隐藏了所讨论的图像XObject的名称;但是,您可以获得间接对象,并且可以查询其对象和世代号,这足以满足下一步。)

As you already use iText in this context, you can also use it to find out which images are required for which pages. Use the iText parser package to initially parse the PDF page by page using a RenderListener implementation whose RenderImage method simply remembers which image objects are used on the current page. (As a special twist, iText hides the name of the image XObject in question; you get the indirect object, though, and can query its object and generation number which suffices for the next step.)

在第二步中,您在 PdfStamper 中打开文档并遍历页面。对于每个页面,您检索 / Resources 字典并复制它,但仅复制那些引用其中一个图像对象的XObjects引用,这些图像对象的对象编号和生成是您在第一步中记住的相应页面。最后,将缩小的副本设置为相关页面的 / Resources 字典。

In a second step, you open the document in a PdfStamperand iterate over the pages. For each page you retrieve the /Resources dictionary and copy it, but only copy those XObjects references referencing one of the image objects whose object number and generation you remembered for the respective page during the first step. Finally you set the diminished copy as the /Resources dictionary of the page in question.

生成的PDF应该分割得很好。

The resulting PDF should split just fine.

PS 最近在iText邮件列表上出现了一个非常类似的问题。 在该主题中,此处给出的解决方案配方有已经改进了,为了解决iText隐藏xobject名称所带来的困难,我现在建议在名称丢失之前通过使用不同的 ContentOperator 进行干预Do,这里是Java版本:

PS A very similar issue recently came up on the iText mailing list. In that thread the solution recipe given here has been improved, to get around the difficulties caused by iText hiding the xobject name, I now would propose to intervene before the name is lost by using a different ContentOperator for "Do", here the Java version:

class Do implements ContentOperator 
{ 
    public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) throws IOException 
    { 
        PdfName xobjectName = (PdfName)operands.get(0); 
        names.add(xobjectName); 
    } 

    final List<PdfName> names = new ArrayList<PdfName>(); 
} 

此内容运算符只是收集使用过的xobjects的名称,即xobject资源保留给定的页面。

This content operator simply collects the names of the used xobjects, i.e. the xobject resources to keep for the given page.

这篇关于itextsharp:拆分页面大小等于文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆