iTextSharp的 - 如何得到一个页面上字的位置 [英] iTextSharp - How to get the position of word on a page

查看：702 发布时间：2016/8/26 20:04:50 c# pdf itextsharp

本文介绍了iTextSharp的 - 如何得到一个页面上字的位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用iTextSharp的和reader.GetPageContent方法将文本拉出一个PDF的。我需要找到的文档中找到的每个字矩形/位置。有什么办法使用iTextSharp的获得在PDF一个字的矩形/位置？

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?

推荐答案

是的，有。退房的 text.pdf.parser 包装，特别是 LocationTextExtractionStrategy 。事实上，这可能不会做的伎俩无论是。你可能会想编写自己的 TextExtractionStrategy 来送入PdfTextExtractor：

Yes there is. Check out the text.pdf.parser package, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy to feed into PdfTextExtractor:

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

您可能会想看看源LocationTextExtractionStrategy来看看它是如何结合共享的基准文本。你甚至可能只是修改LTES存储字符串和rects并行阵列。

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

PS：打造rects，你可以得到的AscentLine＆放大器; DescentLine并使用这些坐标的顶部和底部的角

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

警告：以上code屁股-U-MES文本是水平和收益由左到右。旋转文本会搞砸了，因为将垂直文本或从右到左（阿拉伯语，希伯来语）的文本。对于大多数应用，上面应该是罚款，但知道它的极限。

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

好猎手。

这篇关于iTextSharp的 - 如何得到一个页面上字的位置的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

iTextSharp的 - 如何得到一个页面上字的位置 [英] iTextSharp - How to get the position of word on a page

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

iTextSharp的 - 如何得到一个页面上字的位置 [英] iTextSharp - How to get the position of word on a page

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭