如何从iText 7中的pdf页面获取文本位置 [英] How to get the text position from the pdf page in iText 7

查看:970
本文介绍了如何从iText 7中的pdf页面获取文本位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在PDF页面中找到文本位置吗?

I am trying to find the text position in PDF page?

我尝试过的是使用简单的文本提取策略通过PDF Text Extractor获取PDF页面中的文本.我正在循环搜索每个单词,以检查我的单词是否存在.使用以下命令分割单词:

What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:

var Words = pdftextextractor.Split(new char[] { ' ', '\n' });

我无法做的是找到文本位置.问题是我无法找到文本的位置.我需要找到的只是PDF文件中单词的y坐标.

What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.

推荐答案

首先,SimpleTextExtractionStrategy并不完全是最聪明"的策略(顾名思义就是这样.

First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.

第二,如果您想担任该职位,您将需要做更多的工作. TextExtractionStrategy假定您只对文本感兴趣.

Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.

可能的实现:

  • 实现IEventListener
  • 获得所有呈现文本的事件的通知,并存储相应的TextRenderInfo对象
  • 完成文档后,根据对象在页面中的位置对这些对象进行排序
  • 循环遍历此TextRenderInfo对象列表,它们同时提供要渲染的文本和坐标

方法:

  1. 实施ITextExtractionStrategy(或扩展现有的 实施)
  2. 使用PdfTextExtractor.getTextFromPage(doc.getPage(pageNr),strategy),其中strategy表示您在步骤1中创建的策略.
  3. 您应该设置策略来跟踪处理的文本的位置
  1. implement ITextExtractionStrategy (or extend an existing implementation)
  2. use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
  3. your strategy should be set up to keep track of locations for the text it processed

ITextExtractionStrategy在其界面中具有以下方法:

ITextExtractionStrategy has the following method in its interface:

@Override
public void eventOccurred(IEventData data, EventType type) {

    // you can first check the type of the event
     if (!type.equals(EventType.RENDER_TEXT))
        return;

    // now it is safe to cast
    TextRenderInfo renderInfo = (TextRenderInfo) data;
}

要记住的重要一点是pdf中的渲染指令不需要按顺序出现. 文本"Lorem Ipsum Dolor Sit Amet"的显示方式类似于: 渲染"Ipsum Do"
渲染"Lorem"
渲染"lor Sit Amet"

Important to keep in mind is that rendering instructions in a pdf do not need to appear in order. The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to: render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"

您将必须进行一些巧妙的合并(取决于两个TextRenderInfo对象的间隔),并进行排序(以正确的阅读顺序获取所有TextRenderInfo对象.

You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.

完成后,应该很容易.

这篇关于如何从iText 7中的pdf页面获取文本位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆