使用PDFbox确定文档中单词的坐标 [英] Using PDFbox to determine the coordinates of words in a document

查看：502 发布时间：2020/5/25 3:55:14 java pdf pdfbox

本文介绍了使用PDFbox确定文档中单词的坐标的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用PDFbox提取PDF文档中单词/字符串的坐标，到目前为止，已经成功地确定了各个字符的位置.到目前为止，这是距PDFbox文档的代码:

I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

这会产生一系列包含每个字符位置(包括空格)的行，如下所示:

This produces a series of lines containing the position of each character, including spaces, that looks like this:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

其中"P"是字符.我无法在PDFbox中找到查找单词的功能，并且我对Java不够熟悉，即使将空格也包括在内，也无法准确地将这些字符重新组合成单词以进行搜索.其他人也遇到过类似情况吗?如果是，您是如何处理的?我真的只需要单词中第一个字符的坐标，以便简化部分，但是关于如何将字符串与这种输出匹配的问题超出了我的范围.

Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.

使用PDFbox确定文档中单词的坐标 [英] Using PDFbox to determine the coordinates of words in a document

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用PDFbox确定文档中单词的坐标 [英] Using PDFbox to determine the coordinates of words in a document

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭