使用java搜索文本并在pdf中获得位置 [英] Search texts and get position in pdf with java

查看:72
本文介绍了使用java搜索文本并在pdf中获得位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 java 在 pdf 中搜索文本并获得位置?我尝试过 apache pdfbox 和 pdfclown 但每当文本下降或开始一个新段落时,它都不起作用.我想得到与下图相同的结果.

How can I search for text and get position in pdf with java ? I tried with apache pdfbox and pdfclown but whenever the text goes down or start a new paragraph, it doesn't work. I want to get the same result like in the picture below.

谢谢.

预期结果

推荐答案

您将我之前的一个回答称为一个对您不起作用的 PDFBox 示例.事实上,正如在那个答案中已经解释的那样,看到代码匹配除单个单词之外的任何内容是令人惊讶的,因为在那里覆盖的例程的调用者给人的印象是逐字调用它.因此,确实很难找到跨越一条线的任何东西.

You referred to one of my earlier answers as an example for PDFBox which did not work for you. Indeed, as already explained in that answer it was a surprise to see that code match anything beyond single words as the callers of the routine overridden there gave the impression of calling it word-by-word. Thus, anything spanning more than a single line indeed hardly could be expected to be found.

但是可以以一种非常自然的方式改进该示例以允许跨行边界搜索,假设行在空格处分开.用这个改进的版本替换方法 findSubwords:

But one can improve that example in quite a natural manner to allow searches across line borders, assuming lines are split at spaces. Replace the method findSubwords by this improved version:

List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPosition> allTextPositions = new ArrayList<>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            allTextPositions.addAll(textPositions);
            super.writeString(text, textPositions);
        }

        @Override
        protected void writeLineSeparator() throws IOException {
            if (!allTextPositions.isEmpty()) {
                TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
                if (!" ".equals(last.getUnicode())) {
                    Matrix textMatrix = last.getTextMatrix().clone();
                    textMatrix.setValue(2, 0, last.getEndX());
                    textMatrix.setValue(2, 1, last.getEndY());
                    TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
                            textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
                            new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
                    allTextPositions.add(separatorSpace);
                }
            }
            super.writeLineSeparator();
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);

    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    TextPositionSequence word = new TextPositionSequence(allTextPositions);
    String string = word.toString();

    int fromIndex = 0;
    int index;
    while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
    {
        hits.add(word.subSequence(index, index + searchTerm.length()));
        fromIndex = index + 1;
    }

    return hits;
}

(SearchSubword 方法)

在这里,我们收集了所有 TextPosition 条目,实际上我们甚至在 PDFBox 添加换行符时添加了代表空格的虚拟条目.一旦整个页面被渲染,我们就会搜索所有这些文本位置的集合.

Here we collect all TextPosition entries, we actually even add virtual such entries representing a space whenever a line break is added by PDFBox. As soon as the whole page is rendered, we search the collection of all these text positions.

应用于示例文档 在原始问题中,

Applied to the example document in the original question,

寻找 "${var 2}" 现在返回所有 8 次出现,还有那些跨行分割的:

looking for "${var 2}" now returns all 8 occurrences, also those split across lines:

* Looking for '${var 2}' (improved)
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
  Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
  Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
  Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
  Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21

出现负宽度是因为匹配结束的 x 坐标小于其开始的 x 坐标.

The negative widths occur because the x coordinate of the end of the match is less than that of its start.

这篇关于使用java搜索文本并在pdf中获得位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆