如何获得在C#中使用iTextSharp的pdf格式文件中的特定段落? [英] how to get the particular paragraph in pdf file using iTextSharp in C#?

查看:291
本文介绍了如何获得在C#中使用iTextSharp的pdf格式文件中的特定段落?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想在PDF文件特别段落。这是可能的iTextSharp的?

I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. Is this possible in iTextSharp?

推荐答案

是,也不是。

首先是没有。该PDF格式没有文本结构的概念,如段落,句子或什字,它只是有文字的运行。这两个文本运行接近对方,使我们认为它们是结构化的事实是人的事情。当你看到的东西,看起来像一个PDF一个三线款,在现实中产生的PDF实际上做拿刀砍文成三个不相关的文本行的作业,然后在画专用X每行的程序,y坐标。而更糟的是,这取决于设计师想要什么,文本的每一行可以由小批量生产,可能是文字,甚至只是字符。所以它可能是画,在帽子的猫在10,10 或者它可能是画T10,10,然后画H在14,10,然后在18,10 等绘制的e。这实际上是与像Adobe InDesign中设计了大量的PDF文件的程序很常见的。

First the no. The PDF format doesn't have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. The fact that two runs of text are near to each other so that we think of them as structured is a human thing. When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. So it might be draw "the cat in the hat" at 10,10 or it might be draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10 and so on. This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign.

现在的肯定。其实它是一种可能。如果你愿意把在一点点的工作,你也许能够得到iTextSharp的做你在找什么。有一个名为 PdfTextExtractor 类,它有一个名为 GetTextFromPage ,将得到所有从页面的原始文本的方法。这个方法的最后一个参数是实现 ITextExtractionStrategy 接口的对象。如果你创建自己的类实现此接口可以处理文本的每次运行并执行自己的逻辑。

Now the yes. Actually its a maybe. If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. There is a class called PdfTextExtractor that has a method called GetTextFromPage that will get all of the raw text from a page. The last parameter to this method is an object that implements the ITextExtractionStrategy interface. If you create your own class that implements this interface you can process each run of text and perform your own logic.

在这个界面有一个名为 RenderText 这被称为文本的每一次运行。像当前坐标,它你会得到一个 iTextSharp.text.pdf.parser.TextRenderInfo 对象,从中可以得到运行的原始文本,以及其他的东西开始时,当前字体等。由于文字的视线可以由多个运行的,你可以用这个方法来比较运行的基线(起始X坐标)到以前的运行,以确定它是否是部分。同样的视线

In this interface there's a method called RenderText which gets called for every run of text. You'll be given a iTextSharp.text.pdf.parser.TextRenderInfo object from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run's baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line.

下面是该接口的实现的例子:

Below is an example of an implementation of that interface:

    public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
        //Text buffer
        private StringBuilder result = new StringBuilder();

        //Store last used properties
        private Vector lastBaseLine;

        //Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
        public List<string> strings = new List<String>();
        public List<float> baselines = new List<float>();

        //This is called whenever a run of text is encountered
        public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
            //This code assumes that if the baseline changes then we're on a newline
            Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();

            //See if the baseline has changed
            if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
                //See if we have text and not just whitespace
                if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                    //Mark the previous line as done by adding it to our buffers
                    this.baselines.Add(this.lastBaseLine[Vector.I2]);
                    this.strings.Add(this.result.ToString());
                }
                //Reset our "line" buffer
                this.result.Clear();
            }

            //Append the current text to our line buffer
            this.result.Append(renderInfo.GetText());

            //Reset the last used line
            this.lastBaseLine = curBaseline;
        }

        public string GetResultantText() {
            //One last time, see if there's anything left in the buffer
            if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                this.baselines.Add(this.lastBaseLine[Vector.I2]);
                this.strings.Add(this.result.ToString());
            }
            //We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
            return null;
        }

        //Not needed, part of interface contract
        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderImage(ImageRenderInfo renderInfo) { }
    }

要叫它我们应该这样做:

To call it we'd do:

        PdfReader reader = new PdfReader(workingFile);
        TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
        iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
        for (int i = 0; i < S.strings.Count; i++) {
            Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
        }

我们实际上是从 GetTextFromPage扔掉价值,而是检查工人的基线字符串数组字段。这样做的下一步将是比较基准,并试图确定如何组行一起成为段落。

We're actually throwing away the value from GetTextFromPage and instead inspecting the worker's baselines and strings array fields. The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs.

我要指出,并不是所有的段落间距这是不同的从文本中的各个行。例如,如果你运行通过下面的代码创建的PDF上面你会看到文本的每一道线条都是相互18分离开,不管如果线路形成一个新的段落或没有。如果你打开它创建在Acrobat PDF格式,涵盖一切,但每一行的第一个字母,你会发现你的眼睛甚至不能告诉一个换行符和分段符之间的差异。

I should note, not all paragraphs have spacing that's different from individual lines of text. For instance, if you run the PDF created below through the code above you'll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you'll see that your eye can't even tell the difference between a line break and a paragraph break.

        using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
            using (Document doc = new Document(PageSize.LETTER)) {
                using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                    doc.Open();
                    doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
                    doc.Add(new Paragraph("This"));
                    doc.Add(new Paragraph("Is"));
                    doc.Add(new Paragraph("A"));
                    doc.Add(new Paragraph("Test"));
                    doc.Close();
                }
            }
        }

这篇关于如何获得在C#中使用iTextSharp的pdf格式文件中的特定段落?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆