尝试使用iTextSharp从PDF中删除内嵌图像时出现问题 [英] Issue when trying to remove inline images from PDF with iTextSharp

查看:171
本文介绍了尝试使用iTextSharp从PDF中删除内嵌图像时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了iTextSharp。

I recently discovered iTextSharp.

我正在调查PDF文档呈现的性能问题,而Bruno Lowagie(iText的作者)向我解释了我遇到这样一个问题的原因:它是由于我的PDF文档中的内嵌图像的数量。他还解释了删除那些内嵌图像的基础知识......(我的目的是可能显示文档的预览,并清楚地注意到它不是实际文档,而且这个文档可能打开速度很慢。我清楚地明白,我想要做的远非强大/安全/ ......问题必须在另一个层面解决,例如:生成文件时,......)

I was investigating a performance issue with the rendering of PDF documents and Bruno Lowagie (author of iText) kindly explained to me the reason why I was experiencing such an issue : it was due to the amount of "Inline Images" in my PDF documents. He also explained the basics to remove those "Inline Images"... (My purpose is to "possibly" show a preview of the document with a clear notice that it's not the actual document and that this one could be very slow to open. I clearly understand that what I am trying to do is far from robust/safe/... The problem must be solved at another level, e.g.: when generating the documents, ...)

不幸的是,我没有成功实现自己的清理:/
这是我目前的一些代码(灵感来自stackOverflow上的各种样本)......

Unfortunately, I don't succeed in implementing the clean-up on my own :/ Here is some code I currently have (inspired from various samples found on stackOverflow)...

PdfReader pdfReader = new PdfReader(filename);
try
{  
    //pdfReader.RemoveUnusedObjects();

    var cleanfilename = filename.Replace(".pdf", ".clean.pdf");
    if (File.Exists(cleanfilename))
        File.Delete(cleanfilename);

    using (var file = new FileStream(cleanfilename, FileMode.Create))
    {
        var pdfstamper = new PdfStamper(pdfReader, file);

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {    
            PdfDictionary pageDict = pdfReader.GetPageN(page);
            PdfObject pageObj = pageDict.GetDirectObject(PdfName.CONTENTS);
            if (pageObj.IsStream())
            {
                CleanStream(pageObj);
            }
            else if (pageObj.IsArray())
            {
                PdfArray pageArray = pageDict.GetAsArray(PdfName.CONTENTS);

                for (int j = 0; j < pageArray.Size; j++)
                {
                    PdfIndirectReference arrayElement = (PdfIndirectReference)pageArray[j];
                    pageObj = pdfReader.GetPdfObject(arrayElement.Number);
                    if (pageObj.IsStream())
                    {
                        CleanStream(pageObj);
                    }
                }
            }
        }

        pdfstamper.Close();
    }
}
catch (Exception ex)
{
    MessageBox.Show("Error: " + ex.Message, "Error");
}
finally
{
    pdfReader.Close();
}

Regex regEx = new Regex("\\nBI.*?\\nEI", RegexOptions.Compiled);

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newContent = regEx.Replace(currentContent, "");
    var newData = Encoding.ASCII.GetBytes(newContent);

    stream.SetData(newData);
}

在没有内嵌图像的PDF上工作正常...但文字是从有内联图像的页面中消失。

It works fine on PDF without Inline Images... But "Text" is disappearing from pages where there are Inline Images.

我认为问题在于替换。但据我所知,情况并非如此。
使用以下代码(passthrough类型),输出文档没问题:

I thought the problem was with the Replacement. But it's not the case as far as I can tell. Using the following code (kind of passthrough), the output document is ok:

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    stream.SetData(data);
}

然而使用以下代码,理论上不会改变任何字节(是吗? ?),输出文件不再显示(某些内容似乎没有呈现)?!?!?

Using however the following code, which is theoretically not changing any byte (does it ?), the output documents does not display fine any more (some content seems to not be rendered) ?!?!?

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newData = Encoding.ASCII.GetBytes(currentContent);

    stream.SetData(newData);
}

我看起来像将字节数组转换为字符串并返回到数组中不是透明的操作。

I looks like converting the byte array into a string and back into an array is not a "transparent" operation.

我真的不明白!?!但另一方面,我知道我是关于PDF的真正的初学者。
我缺少什么?

I really don't get it !?! But on the other side, I know I am real beginner regarding PDF. What am I missing ?

这一点都不重要(如果我不能成功删除这些内嵌图像,我真的不在乎)。但我现在非常好奇了解发生了什么:D

This is not at all critical (I don't really care if I can't succeed in removing those inline images). But I am now really curious about understanding what's happening :D

这是一个PDF样本:
https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing

Here is a PDF sample : https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing

推荐答案

正如你所发现的那样,正如我在评论中指出的那样,在没有采取内容的情况下操纵内容流并不是一个好主意。查看流中的每个运算符。您真的需要解析语法并解释每个运算符和每个操作数。

As you've found out and as mkl and I pointed out in the comments, it's not a good idea to manipulate a content stream without taking a look at every operator in the stream. You really need to parse the syntax and interpret every single operator and every single operand.

请查看iText提供的额外jar中的OCG删除功能在 com.itextpdf.text.pdf.ocg / 包。

Please take a look at the OCG removing functionality in the extra jar that is provided with iText in the com.itextpdf.text.pdf.ocg/ package.

OCGParser 类中,我们定义所有可能的运算符:

In the OCGParser class, we define all possible operators:

protected void populateOperators() {
    if (operators != null)
        return;
    operators = new HashMap<String, PdfOperator>();
    operators.put(DEFAULTOPERATOR, new CopyContentOperator());
    PathConstructionOrPaintingOperator opConstructionPainting = new PathConstructionOrPaintingOperator();
    operators.put("m", opConstructionPainting);
    operators.put("l", opConstructionPainting);
    operators.put("c", opConstructionPainting);
    operators.put("v", opConstructionPainting);
    operators.put("y", opConstructionPainting);
    operators.put("h", opConstructionPainting);
    operators.put("re", opConstructionPainting);
    operators.put("S", opConstructionPainting);
    operators.put("s", opConstructionPainting);
    operators.put("f", opConstructionPainting);
    operators.put("F", opConstructionPainting);
    operators.put("f*", opConstructionPainting);
    operators.put("B", opConstructionPainting);
    operators.put("B*", opConstructionPainting);
    operators.put("b", opConstructionPainting);
    operators.put("b*", opConstructionPainting);
    operators.put("n", opConstructionPainting);
    operators.put("W", opConstructionPainting);
    operators.put("W*", opConstructionPainting);
    GraphicsOperator graphics = new GraphicsOperator();
    operators.put("q", graphics);
    operators.put("Q", graphics);
    operators.put("w", graphics);
    operators.put("J", graphics);
    operators.put("j", graphics);
    operators.put("M", graphics);
    operators.put("d", graphics);
    operators.put("ri", graphics);
    operators.put("i", graphics);
    operators.put("gs", graphics);
    operators.put("cm", graphics);
    operators.put("g", graphics);
    operators.put("G", graphics);
    operators.put("rg", graphics);
    operators.put("RG", graphics);
    operators.put("k", graphics);
    operators.put("K", graphics);
    operators.put("cs", graphics);
    operators.put("CS", graphics);
    operators.put("sc", graphics);
    operators.put("SC", graphics);
    operators.put("scn", graphics);
    operators.put("SCN", graphics);
    operators.put("sh", graphics);
    XObjectOperator xObject = new XObjectOperator();
    operators.put("Do", xObject);
    InlineImageOperator inlineImage = new InlineImageOperator();
    operators.put("BI", inlineImage);
    operators.put("EI", inlineImage);
    TextOperator text = new TextOperator();
    operators.put("BT", text);
    operators.put("ID", text);
    operators.put("ET", text);
    operators.put("Tc", text);
    operators.put("Tw", text);
    operators.put("Tz", text);
    operators.put("TL", text);
    operators.put("Tf", text);
    operators.put("Tr", text);
    operators.put("Ts", text);
    operators.put("Td", text);
    operators.put("TD", text);
    operators.put("Tm", text);
    operators.put("T*", text);
    operators.put("Tj", text);
    operators.put("'", text);
    operators.put("\"", text);
    operators.put("TJ", text);
    MarkedContentOperator markedContent = new MarkedContentOperator();
    operators.put("BMC", markedContent);
    operators.put("BDC", markedContent);
    operators.put("EMC", markedContent);
}

parse()方法将查看所有内容流,包括Form XObjects的内容流(如果我正确理解你的代码,你会忽略它)。

The parse() method will look at all the content streams, including the content streams of Form XObjects (which you are overlooking if I understand your code correctly).

进程中()方法,我们制作每个运算符及其所有操作数的副本,除非某些条件告诉我们需要删除部分语法。

In the process() method, we make copies of every operator and all its operands, unless some condition tells us that part of the syntax needs to be removed.

你应该调整这段代码这样所有操作符都被复制,除了那些涉及内嵌图像的操作符。你的方法是一种蛮力方法,必然会损坏更多的PDF文件。

You should adapt this code so that all operators are copied, except those that involve an inline images. Your approach was a brute force approach that was bound to damage more PDFs than it would ever fix.

这篇关于尝试使用iTextSharp从PDF中删除内嵌图像时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆