TextRenderInfo如何在iTextSharp中工作? [英] How does TextRenderInfo work in iTextSharp?

查看:1039
本文介绍了TextRenderInfo如何在iTextSharp中工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从网上获得了一些代码,他们正在为我提供字体大小。我不明白TextRenderInfo是如何阅读文本的。我尝试使用renderInfo.GetText()),它提供随机数字符,有时是3个字符,有时是2个字符或更多或更少。我需要知道renderInfo是如何读取数据的?

I have got some codes from online and they are providing me the font sizes. I did not understand how the TextRenderInfo is reading text. I tried with renderInfo.GetText()) which is giving random number of characters, sometimes 3 characters, sometimes 2 characters or more or less. I need to know how the renderInfo is reading data ?

我的目的是将每个行和段落与pdf分开,并单独阅读其属性,如字体大小,字体样式等。如果您有任何建议,请提及它们。

My intention is to separate every lines and paragraphs from pdf and also read their properties individually such as font size, font style etc. If you have any suggestion, please mention them.

using System;    
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace FontSizeDig1
{
class Program
{
    static void Main(string[] args)
    {
        // reader ==>                 http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfReader.html#pdfVersion
        PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "document.pdf"));
        TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();//strategy==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextExtractionStrategy.html
    //    for (int i = 1; i <= reader.NumberOfPages; i++)
    //   {
            string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1/*i*/, S);
            //  PdfTextExtractor.GetTextFromPage(reader, 6, S) ==>>    http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/PdfTextExtractor.html
            Console.WriteLine(F);


      //  }
        Console.ReadKey();
        //this.Close();
    }
}


public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{

    //HTML buffer
    private StringBuilder result = new StringBuilder();

    //Store last used properties
    private Vector lastBaseLine;
    private string lastFont;
    private float lastFontSize;

    //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
    private enum TextRenderMode
    {
        FillText = 0,
        StrokeText = 1,
        FillThenStrokeText = 2,
        Invisible = 3,
        FillTextAndAddToPathForClipping = 4,
        StrokeTextAndAddToPathForClipping = 5,
        FillThenStrokeTextAndAddToPathForClipping = 6,
        AddTextToPaddForClipping = 7
    }



    public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
    {
        string curFont = renderInfo.GetFont().PostscriptFontName;  // http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextRenderInfo.html#getFont--
        //Check if faux bold is used
        if ((renderInfo.GetTextRenderMode() == 2/*(int)TextRenderMode.FillThenStrokeText*/))
        {
            curFont += "-Bold";
        }

        //This code assumes that if the baseline changes then we're on a newline
        Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
        Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
        iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
        Single curFontSize = rect.Height;



        //See if something has changed, either the baseline, the font or the font size
        if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
        {
            //if we've put down at least one span tag close it
            if ((this.lastBaseLine != null))
            {
                this.result.AppendLine("</span>");
            }
            //If the baseline has changed then insert a line break
            if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
            {
                this.result.AppendLine("<br />");
            }
            //Create an HTML tag with appropriate styles
            this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
        }

        //Append the current text

        this.result.Append(renderInfo.GetText());
        Console.WriteLine("me=" + renderInfo.GetText());//by imtiaj 




        //Set currently used properties
        this.lastBaseLine = curBaseline;
        this.lastFontSize = curFontSize;
        this.lastFont = curFont;
    }

    public string GetResultantText()
    {
        //If we wrote anything then we'll always have a missing closing tag so close it here
        if (result.Length > 0)
        {
            result.Append("</span>");
        }
        return result.ToString();
    }

    //Not needed
    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderImage(ImageRenderInfo renderInfo) { }


}

}

推荐答案

看一下这篇PDF:

你看到了什么?

我明白了:


Hello World
Hello People

Hello World Hello People

现在,让我们解析这个文件?你期待什么?

Now, let's parse this file? What do you expect?

你可能期望:


Hello World
Hello People

Hello World Hello People

我没有。

那就是你的位置我有所不同,这种差异解释了为什么你问这个问题。

That's where you and I differ, and that difference explains why you ask this question.

我期待什么?

嗯,我将首先查看PDF内容,更具体地说是在第一页的内容流中:

Well, I'll start by looking inside the PDF, more specifically at the content stream of the first page:

我在内容流中看到4个字符串: ld Wor llo (按此顺序)。我也看到坐标。使用这些坐标,我可以编写显示的内容:

I see 4 strings in the content stream: ld, Wor, llo, and He (in that order). I also see coordinates. Using those coordinates, I can compose what is shown:


Hello World

Hello World

我没有立即在任何地方看到Hello People,但我确实看到了一个名为 / Xf1 的Form XObject的引用,所以让我们来看看Form XObject:

I don't immediately see "Hello People" anywhere, but I do see a reference to a Form XObject named /Xf1, so let's examine that Form XObject:

哇!我很幸运,Hello People作为单个字符串值存储在文档中。我不需要查看坐标来构成我能用人眼看到的实际文本。

Woohoo! I'm in luck, "Hello People" is stored in the document as a single string value. I don't need to look at the coordinates to compose the actual text that I can see with my human eyes.

现在提出您的问题。你说我需要知道renderInfo是如何读取数据的现在你知道:默认情况下,iText将按照它们出现的顺序读取页面中的所有字符串: ld Wor llo Hello People

Now for your question. You say "I need to know how the renderInfo is reading data" and now you know: by default, iText will read all the strings from a page in the order they occur: ld, Wor, llo, He, and Hello People.

根据PDF的创建方式,您可以获得易于阅读的输出( Hello People ),或难以阅读的输出( ld Wor llo )。 iText附带策略,重新排序所有这些片段,以便[ ld Wor llo ]显示为[ llo Wor ld ],但检测哪些部分属于同一行,以及哪些行属于同一段,是您必须要做的事情。

Depending on how the PDF is created, you can have output that is easy to read (Hello People), or output that is hard to read (ld, Wor, llo, He). iText comes with "strategies" that reorder all those snippets so that [ld, Wor, llo, He] is presented as [He, llo, Wor, ld], but detecting which of those parts belong to the same line, and which lines belong to the same paragraph, is something you will have to do.

注意:在iText集团,我们已经有很多闭源代码,可以节省您充足的时间。由于我们是iText图书馆的版权所有者,我们可以要求资金获取该封闭源代码。如果您免费使用iText(因为AGPL),这通常是您无法做到的。但是,如果您是iText的客户,我们可能会披露更多源代码。不要指望我们免费提供该代码,因为该代码具有太多的商业价值。

NOTE: at iText Group, we already have plenty of closed source code that could save you plenty of time. Since we are the copyright owner of the iText library, we can ask money for that closed source code. That's something you typically can't do if you're using iText for free (because of the AGPL). However, if you are a customer of iText, we can probably disclose more source code. Do not expect us to give that code for free, as that code has too much commercial value.

这篇关于TextRenderInfo如何在iTextSharp中工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆