如何搜索某些特定的字符串或单词并在Java中的pdf文档中找到坐标 [英] How to search some specific string or a word and there coordinates from a pdf document in java

查看:192
本文介绍了如何搜索某些特定的字符串或单词并在Java中的pdf文档中找到坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Pdfbox从pdf文件中搜索单词(或字符串),我也想知道该单词的坐标.例如:-在pdf文件中,有一个类似"$ {abc}"的字符串.我想知道这个字符串的坐标.我尝试了一些示例,但根据我的理解并没有得到结果.结果显示了字符的坐标.

I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.

代码在这里

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for(TextPosition text : textPositions) {


        System.out.println( "String[" + text.getXDirAdj() + "," +
                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                text.getWidthOfSpace() + " width=" +
                text.getWidthDirAdj() + "]" + text.getUnicode());

    }
}

我正在使用pdfbox 2.0

I am using pdfbox 2.0

推荐答案

PDFBox的 PDFTextStripper 类在其位置上仍具有位置文本(在将其简化为纯文本之前)的最后一种方法是该方法.

The last method in which PDFBox' PDFTextStripper class still has text with positions (before it is reduced to plain text) is the method

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException

应该在这里拦截,因为此方法会接收经过预处理的对象,特别是 sorted TextPosition 对象(如果一个对象请求 sorting 开始)

One should intercept here because this method receives pre-processed, in particular sorted TextPosition objects (if one requested sorting to start with).

(实际上,我更愿意在调用方法 writeLine 中进行拦截,该方法根据其参数的名称和局部变量具有的所有 TextPosition 实例> line 并每个 word 调用一次 writeString ;但是,不幸的是,PDFBox开发人员已将此方法声明为私有...好吧,也许直到最后2.0版时,这种方法才会更改.0版本... 轻按.更新:不幸的是,它在版本中没有更改... 叹气)

(Actually I would have preferred to intercept in the calling method writeLine which according to the names of its parameters and local variables has all the TextPosition instances of a line and calls writeString once per word; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)

此外,使用帮助器类将 TextPosition 实例的序列包装在类似 String 的类中以使代码更清晰是很有帮助的.

Furthermore it is helpful to use a helper class to wrap sequences of TextPosition instances in a String-like class to make code clearer.

记住这一点,就可以搜索这样的变量

With this in mind one can search for the variables like this

List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };

    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}

使用此帮助程序类

public class TextPositionSequence implements CharSequence
{
    public TextPositionSequence(List<TextPosition> textPositions)
    {
        this(textPositions, 0, textPositions.size());
    }

    public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
    {
        this.textPositions = textPositions;
        this.start = start;
        this.end = end;
    }

    @Override
    public int length()
    {
        return end - start;
    }

    @Override
    public char charAt(int index)
    {
        TextPosition textPosition = textPositionAt(index);
        String text = textPosition.getUnicode();
        return text.charAt(0);
    }

    @Override
    public TextPositionSequence subSequence(int start, int end)
    {
        return new TextPositionSequence(textPositions, this.start + start, this.start + end);
    }

    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder(length());
        for (int i = 0; i < length(); i++)
        {
            builder.append(charAt(i));
        }
        return builder.toString();
    }

    public TextPosition textPositionAt(int index)
    {
        return textPositions.get(start + index);
    }

    public float getX()
    {
        return textPositions.get(start).getXDirAdj();
    }

    public float getY()
    {
        return textPositions.get(start).getYDirAdj();
    }

    public float getWidth()
    {
        TextPosition first = textPositions.get(start);
        TextPosition last = textPositions.get(end);
        return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
    }

    final List<TextPosition> textPositions;
    final int start, end;
}

仅输出其位置,宽度,最后一个字母和最后一个字母的位置,然后就可以使用

To merely output their positions, widths, final letters, and final letter positions, you can then use this

void printSubwords(PDDocument document, String searchTerm) throws IOException
{
    System.out.printf("* Looking for '%s'\n", searchTerm);
    for (int page = 1; page <= document.getNumberOfPages(); page++)
    {
        List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
        for (TextPositionSequence hit : hits)
        {
            TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
            System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
                    page, hit.getX(), hit.getY(), hit.getWidth(),
                    lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
        }
    }
}

对于测试,我使用MS Word创建了一个小的测试文件:

For tests I created a small test file using MS Word:

此测试的输出

@Test
public void testVariables() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");
            PDDocument document = PDDocument.load(resource);    )
    {
        System.out.println("\nVariables.pdf\n-------------\n");
        printSubwords(document, "${var1}");
        printSubwords(document, "${var 2}");
    }
}

Variables.pdf
-------------

* Looking for '${var1}'
  Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
  Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
  Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
  Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18

* Looking for '${var 2}'
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

我有点惊讶,因为如果发现 $ {var 2} 在同一行上,那么我会感到惊讶.毕竟,PDFBox代码使我假设我覆盖的方法 writeString 只检索 words ;看起来好像检索到的行部分比仅仅单词要长.

I was a bit surprised because ${var 2} has been found if on a single line; after all, PDFBox code made me assume the method writeString I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...

如果您需要从分组的 TextPosition 实例中获取其他数据,只需相应地增强 TextPositionSequence .

If you need other data from the grouped TextPosition instances, simply enhance TextPositionSequence accordingly.

这篇关于如何搜索某些特定的字符串或单词并在Java中的pdf文档中找到坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆