iTextSharp的 - 如何得到一个页面上字的位置 [英] iTextSharp - How to get the position of word on a page

查看:702
本文介绍了iTextSharp的 - 如何得到一个页面上字的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用iTextSharp的和reader.GetPageContent方法将文本拉出一个PDF的。我需要找到的文档中找到的每个字矩形/位置。有什么办法使用iTextSharp的获得在PDF一个字的矩形/位置?

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is there any way to get the rectangle/position of a word in a PDF using iTextSharp?

推荐答案

是的,有。退房的 text.pdf.parser 包装,特别是 LocationTextExtractionStrategy 。事实上,这可能不会做的伎俩无论是。你可能会想编写自己的 TextExtractionStrategy 来送入PdfTextExtractor:

Yes there is. Check out the text.pdf.parser package, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy to feed into PdfTextExtractor:

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

您可能会想看看源LocationTextExtractionStrategy来看看它是如何结合共享的基准文本。你甚至可能只是修改LTES存储字符串和rects并行阵列。

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

PS:打造rects,你可以得到的AscentLine&放大器; DescentLine并使用这些坐标的顶部和底部的角

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

警告:以上code屁股-U-MES文本是水平和收益由左到右。旋转文本会搞砸了,因为将垂直文本或从右到左(阿拉伯语,希伯来语)的文本。对于大多数应用,上面应该是罚款,但知道它的极限。

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

好猎手。

这篇关于iTextSharp的 - 如何得到一个页面上字的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆