我可以使用iTextSharp从现有的PDF中删除文本对象并输出到新的PDF吗? [英] Can I remove text objects from an existing PDF and output to a new PDF using iTextSharp?

查看:151
本文介绍了我可以使用iTextSharp从现有的PDF中删除文本对象并输出到新的PDF吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是我以前的问题的另一个版本: 我想使用iTextSharp从PDF中获取除文本对象之外的所有对象作为图像

This question is another version of my old question: I want to get all objects except text object as an image from PDF using iTextSharp

我正在开发一个程序,用于使用iTextSharp出于特定原因将PDF转换为PPTX.到目前为止,我所做的是获取所有文本对象,图像对象和位置.但是我很难获得没有文本的矢量绘图(例如表格).实际上,如果我可以将它们作为图像获得,那会更好.我的计划是将除文本对象之外的所有对象合并为背景图像,并将文本对象放置在适当的位置.我试图在这里找到类似的问题,但到目前为止还没有运气.如果有人知道如何完成这项特定工作,请回答.谢谢.

I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp. What I've done so far is to get all text objects and image objects and locations. But I'm feeling difficult to get vector drawings without texts (like tables). Actually it would be better if I can get them as images. My plan is to merge all objects except text objects as a background image and put text objects at proper locations. I tried to find similar questions here but no luck so far. If anyone knows how to do this particular job, please answer. Thanks.

我已经阅读了许多相关的问题和讨论,并决定在这里询问其他版本. 我还有以下两个计划.如果iText开发人员/专家可以指导我,我将不胜感激.

I have been reading many related questions and discussions and decided to ask another version here. I have two plans left as follows. I would really appreciate if iText developers/experts could guide me.

public class MyLocationTextExtractionStrategy: IExtRenderListener, ITextExtractionStrategy,IElementListener
{
    //Text 
    public List<RectAndText> myPoints_txt = new List<RectAndText>();
    public List<RectAndImage> myPoints_img = new List<RectAndImage>();
    public FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    public FieldInfo MarkedContentInfosField = typeof(TextRenderInfo).GetField("markedContentInfos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    public FieldInfo MarkedContentInfoTagField = typeof(MarkedContentInfo).GetField("tag", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    PdfName EMBEDDED_DOCUMENT = new PdfName("EmbeddedDocument");

    //Image 
    public List<byte[]> Images = new List<byte[]>();
    public List<string> ImageNames = new List<string>();

    public bool Add(IElement element)
    {
        element = element;
        return true;
    }

    public void BeginTextBlock()
    {

    }

    public void ClipPath(int rule)
    {

    }

    public void EndTextBlock()
    {

    }

    public string GetResultantText()
    {
        return "";
    }

    public void ModifyPath(PathConstructionRenderInfo renderInfo)
    {
        // ****************************************
        // I think this point I can get info on Path
        // ****************************************
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {

        PdfImageObject image = renderInfo.GetImage();
        try
        { 
            image = renderInfo.GetImage();
            if (image == null) return;

            ImageNames.Add(string.Format(
              "Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
            ));

            //Write Image to byte
            using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes()))
            {
                Images.Add(ms.ToArray());
            }
            Matrix matrix = renderInfo.GetImageCTM();

            this.myPoints_img.Add(new RectAndImage(matrix[Matrix.I31], matrix[Matrix.I32], matrix[Matrix.I11], matrix[Matrix.I12], Images));
        }
        catch (Exception e)
        {

        }
    }



    public iTextSharp.text.pdf.parser.Path RenderPath(PathPaintingRenderInfo renderInfo)
    {
        // ****************************************
        // I think this point I can get info on Path
        // ****************************************
        return null;
    }

    public  void RenderText(TextRenderInfo renderInfo)
    {

        DocumentFont _font = renderInfo.GetFont();

        LineSegment descentLine = renderInfo.GetDescentLine();
        LineSegment ascentLine = renderInfo.GetAscentLine();
        float x0 = descentLine.GetStartPoint()[0];
        float x1 = ascentLine.GetEndPoint()[0];
        float y0 = descentLine.GetStartPoint()[1];
        float y1 = ascentLine.GetEndPoint()[1];

        Rectangle rect = new Rectangle(x0,y0,x1,y1);
        GraphicsState gs = (GraphicsState)GsField.GetValue(renderInfo);
        float fontSize = gs.FontSize;
        String font_color = gs.FillColor.ToString().Substring(14,6);

        IList<MarkedContentInfo> markedContentInfos = (IList<MarkedContentInfo>)MarkedContentInfosField.GetValue(renderInfo);

        if (markedContentInfos != null && markedContentInfos.Count > 0)
        {
            foreach (MarkedContentInfo info in markedContentInfos)
            {
                if (EMBEDDED_DOCUMENT.Equals(MarkedContentInfoTagField.GetValue(info)))
                    return;
            }
        }

        this.myPoints_txt.Add(new RectAndText(rect, renderInfo.GetText(), fontSize,renderInfo.GetFont().PostscriptFontName, font_color));
    } 
}

新问题

1)我可以从PDF中删除所有文本对象,然后将其输出到新的文本对象吗? 如果是,我可以将输出的所有页面作为图像,并将它们用作PPTX的背景. 然后,我终于可以编写文本了(已经使用上面的代码使用ITextExtractionStrategy检索了文本)

New question

1) Can I remove all text objects from a PDF and output it to a new one? If yes, I can get all pages of the output as images and use them as backgrounds of a PPTX. Then I can finally write texts (already retrieved using ITextExtractionStrategy using the above code)

2)如果1)不可能,我将从原始PDF中检索所有Path信息(使用IExtRenderListener)并将其绘制在新的Bitmap上. 最后,我可以将其作为背景,并在其上放置文本/图像. 在这种情况下,使用ModifyPath和RenderPath检索路径信息是正确的方法吗?

2) If 1) is not possible, I am going to retrieve all Path information from the original PDF (using IExtRenderListener) and draw them on a new Bitmap. Finally I can put it as a background and put texts/images on that. In this case using ModifyPath and RenderPath for retrieval of Path info is the right way?

我知道这似乎有多个问题,但是我认为最好将所有内容写在一个线程中以帮助理解. 我非常感谢您提出的任何提示或意见.

I know this might seem to have multiple questions, but I think it's better to write all in a single thread to help understanding. I would really appreciate any tips or comments on my thoughts.

我相信@ mkl,@ Amine,@ Bruno Lowagie可以帮助我. 预先感谢.

I believe @mkl, @Amine, @Bruno Lowagie could help me. Thanks in advance.

推荐答案

我对您的旧问题的回答中,我解释了这些IExtRenderListener回调方法的含义,所以本质上剩下的问题是

In my answer to your old question I explained the meanings of those IExtRenderListener callback methods, so essentially the remaining question here is

1)我可以从PDF中删除所有文本对象,然后将其输出到新的文本对象吗?

1) Can I remove all text objects from a PDF and output it to a new one?

您可以使用此答案中的通用内容流编辑器类PdfContentStreamEditor.像这样简单地从中得出

You can by making use of generic content stream editor class PdfContentStreamEditor from this answer. Simply derive from it like this

class TextRemover : PdfContentStreamEditor
{
    protected override void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
    {
        if (!TEXT_SHOWING_OPERATORS.Contains(operatorLit.ToString()))
        {
            base.Write(processor, operatorLit, operands);
        }
    }
    List<string> TEXT_SHOWING_OPERATORS = new List<string> { "Tj", "'", "\"", "TJ" };
}

并像这样使用它

using (PdfReader pdfReader = new PdfReader(source))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write), (char)0, true))
{
    pdfStamper.RotateContents = false;
    PdfContentStreamEditor editor = new TextRemover();

    for (int i = 1; i <= pdfReader.NumberOfPages; i++)
    {
        editor.EditPage(pdfStamper, i);
    }
}

将从即时页面内容流中删除所有文字绘图指令,例如对于我使用的示例PDF

will remove all text drawing instructions from the immediate page content streams, e.g. for the example PDF I used

它创建以下输出:

提防,如上所述,仅即时页面内容流被更改.对于完整的解决方案,必须将TextRemover也应用于页面的XObjects和Patterns,并以递归方式进行.

Beware, as said above, only the immediate page content streams are changed. For a full solution one has to apply the TextRemover also to the XObjects and Patterns of the pages, and recursively so.

这篇关于我可以使用iTextSharp从现有的PDF中删除文本对象并输出到新的PDF吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆