PDFBox-获取单词位置(而不仅仅是字符) [英] PDFBox - getting words locations (and not only characters')

查看:167
本文介绍了PDFBox-获取单词位置(而不仅仅是字符)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用PDFBox来获取单词的位置,类似于"processTextPosition"? 似乎processTextPosition仅在单个字符上调用,并且将它们合并为单词的代码是PDFTextStripper(在"normalize"中)方法的一部分,该方法确实返回文本的位置. 是否有一种方法/实用程序也可以提取位置? (对于那些想知道动机是什么的人-信息实际上是一个表,我们希望检测到空单元格) 谢谢

Is it possible to get the locations of words using PDFBox, similar to "processTextPosition"? It seems that processTextPosition is called on single characters only, and the code that merges them into words is part of PDFTextStripper (in the "normalize") method, which does return the location of the text. Is there a method / utility that extracts the location as well? (For those wondering what the motivation is - the information is actually a table, and we would like to detect empty cells) Thanks

推荐答案

要获取从pdf文件提取的文本中的单词及其x和y位置,您将必须扩展PdfTextStripper类并使用自定义类提取文本来自pdf文件,例如

to get words and their x and y positions in a text extracted from a pdf file you will have to extend the PdfTextStripper class and use the custom class to extract text from the pdf file eg

public class CustomPDFTextStripper extends PDFTextStripper{

    public CustomPDFTextStripper() throws IOException {

    }

    /**
    * Override the default functionality of PDFTextStripper.
    */

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException{
        TextPosition firstProsition = textPositions.get(0);
        writeString(String.format("[%s , %s , %s]", firstProsition.getTextPos().getXPosition(),
                firstProsition.getTextPos().getYPosition(), text));

    }
}

创建此自定义类的对象,并以此提取文本

create an object of this custom class and extract text as thus

PDFTextStripper pdfStripper = new CustomPDFTextStripper();
String text = pdfStripper.getText(*pdf file wrapped as a PDDocument object*);

结果文本字符串采用[xposition,yposition,word]的形式,由默认的单词分隔符分隔

the resultant text string is in the form [xposition, yposition, word] separated by the default word separator

这篇关于PDFBox-获取单词位置(而不仅仅是字符)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆