使用PDFbox确定文档中单词的坐标 [英] Using PDFbox to determine the coordinates of words in a document

查看:502
本文介绍了使用PDFbox确定文档中单词的坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDFbox提取PDF文档中单词/字符串的坐标,到目前为止,已经成功地确定了各个字符的位置.到目前为止,这是距PDFbox文档的代码:

I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override /* this is questionable, not sure if needed... */
    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

这会产生一系列包含每个字符位置(包括空格)的行,如下所示:

This produces a series of lines containing the position of each character, including spaces, that looks like this:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

其中"P"是字符.我无法在PDFbox中找到查找单词的功能,并且我对Java不够熟悉,即使将空格也包括在内,也无法准确地将这些字符重新组合成单​​词以进行搜索.其他人也遇到过类似情况吗?如果是,您是如何处理的?我真的只需要单词中第一个字符的坐标,以便简化部分,但是关于如何将字符串与这种输出匹配的问题超出了我的范围.

Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.

推荐答案

PDFBox中没有允许您自动提取单词的功能.我目前正在提取数据以将其收集到块中,这是我的过程:

There is no function in PDFBox that allows you to extract words automatically. I'm currently working on extracting data to gather it into blocks and here is my process:

  1. 我提取文档的所有字符(称为字形)并将它们存储在列表中.

  1. I extract all the characters of the document (called glyphs) and store them in a list.

我对每个字形的坐标进行分析,遍历整个列表.如果它们重叠(如果当前字形的顶部包含在前一个字形的顶部和底部之间,或者当前字形的底部包含在前一个字形的顶部和底部之间),则将其添加到同一行.

I do an analysis of the coordinates of each glyph, looping over the list. If they overlap (if the top of the current glyph is contained between the top and bottom of the preceding/or the bottom of the current glyph is contained between the top and bottom of the preceding one), I add it to the same line.

这时,我已经提取了文档的不同行(请注意,如果您的文档是多列,则表达式"lines"表示垂直重叠的所有字形,即所有具有相同垂直坐标的列).

At this point, I have extracted the different lines of the document (be careful, if your document is multi-column, the expression "lines" means all the glyphs that overlap vertically, ie the text of all the columns that have the same vertical coordinates).

然后,您可以将当前字形的左坐标与前一个字形的右坐标进行比较,以确定它们是否属于同一单词(PDFTextStripper类提供了getSpacingTolerance()方法,该方法可为您提供,根据试验和错误得出的是正常"空间的值.如果左右坐标之间的差值小于该值,则两个字形都属于同一个字.

Then, you can compare the left coordinate of the current glyph to the right coordinate of the preceding one to determine if they belong to the same word or not (the PDFTextStripper class provides a getSpacingTolerance() method that gives you, based on trials and errors, the value of a "normal" space. If the difference between the right and the left coordinates is lower than this value, both glyphs belong to the same word.

我在工作中使用了这种方法,效果很好.

I applied this method to my work and it works good.

这篇关于使用PDFbox确定文档中单词的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆