iTextSharp提取每个字符并获取getRectangle [英] iTextSharp extract each character and getRectangle

查看:303
本文介绍了iTextSharp提取每个字符并获取getRectangle的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按字符解析整个PDF字符,并能够获取该PDF文档上该字符的ASCII值,字体和矩形,以后可以将其另存为位图.我尝试使用PdfTextExtractor.GetTextFromPage,但这将PDF中的整个文本作为字符串提供.

I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage but that gives the entire text in the PDF as string.

推荐答案

与iTextSharp捆绑在一起的文本提取策略(特别是LocationTextExtractionStrategy默认情况下由PdfTextExtractor.GetTextFromPage重载使用,而没有策略参数)仅允许直接访问收集纯文本,而不是位置.

The text extraction strategies bundled with iTextSharp (in particular the LocationTextExtractionStrategy used by default by the PdfTextExtractor.GetTextFromPage overload without strategy argument) only allows direct access to the collected plain text, not positions.

@Chris Haas在他在这里的旧答案中提出了LocationTextExtractionStrategy

@Chris Haas in his old answer here presents the following extension of the LocationTextExtractionStrategy

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

利用此帮助程序类

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

此策略使文本块及其包围的矩形在公共成员List<RectAndText> myPoints中可用,您可以像这样访问:

This strategy makes the text chunks and their enclosing rectangles available in the public member List<RectAndText> myPoints which you can access like this:

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

对于您的任务,按字符解析整个PDF字符并能够获取该字符的ASCII值,字体和矩形,这里只有两个细节是错误的:

For your task to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character only two details are wrong here:

  • 这样返回的文本块可能包含多个字符
  • 未提供字体信息.

因此,我们必须对其进行一些调整:

Thus, we have to tweak that a bit:

除了MyLocationTextExtractionStrategy类之外,CharLocationTextExtractionStrategy还按字形分割输入,并提供字体名称:

In addition to the MyLocationTextExtractionStrategy class the CharLocationTextExtractionStrategy splits the input by glyph and also provides the font name:

public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);

        foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
        {
            //Get the bounding box for the chunk of text
            var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
            var topRight = renderInfo.GetAscentLine().GetEndPoint();

            //Create a rectangle from it
            var rect = new iTextSharp.text.Rectangle(
                                                    bottomLeft[Vector.I1],
                                                    bottomLeft[Vector.I2],
                                                    topRight[Vector.I1],
                                                    topRight[Vector.I2]
                                                    );

            //Add this to our main collection
            this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
        }
    }
}

//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public String Font;
    public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
    {
        this.Rect = rect;
        this.Text = text;
        this.Font = font;
    }
}

使用这种策略

CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();

using (var pdfReader = new PdfReader(testFile))
{
    PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}

foreach (var p in strategy.myPoints)
{
    Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}

您可以通过字符(包括字体)获得信息.

you get the information by character and including the font.

这篇关于iTextSharp提取每个字符并获取getRectangle的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆