使用PDFBox获取文本行的位置 [英] Using PDFBox to get location of line of text

查看:1383
本文介绍了使用PDFBox获取文本行的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDFBox从pdf中提取信息,而我当前试图查找的信息与该行中第一个字符的x位置有关.我找不到任何有关如何获取该信息的信息.我知道pdfbox有一个名为TextPosition的类,但是我也找不到如何从PDDocument获取TextPosition对象的方法.如何从pdf获取一行文本的位置信息?

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?

推荐答案

一般而言

要使用PDFBox提取文本(带有或不带有其他信息,例如位置,颜色等),请实例化PDFTextStripper或派生自其的类,并按如下方式使用它:

In general

To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper or a class derived from it and use it like this:

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(有许多PDFTextStripper属性可让您限制从中提取文本的页面.)

(There are a number of PDFTextStripper attributes allowing you to restrict the pages text is extracted from.)

在执行getText的过程中,将解析所讨论页面的内容流(以及从这些页面引用的xObject形式的内容流)并处理文本绘制命令.

In the course of the execution of getText the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.

如果要更改文本提取行为,则必须更改此文本绘图命令处理,而这通常是通过重写此方法来完成的:

If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

如果您还需要知道何时换行,您可能还想覆盖

If you additionally need to know when a new line starts, you may also want to override

/**
 * Write the line separator value to the output stream.
 * @throws IOException
 *             If there is a problem writing out the lineseparator to the document.
 */
protected void writeLineSeparator( ) throws IOException
{
    output.write(getLineSeparator());
}

可以覆盖

writeString以便将文本信息引导到单独的成员中(例如,如果您想要的结果结构比单纯的String更为结构化),也可以覆盖

writeString只是向其中添加一些额外的信息结果String.

writeString can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String) or it can be overridden to simply add some extra information into the result String.

writeLineSeparator来触发行之间的某些特定输出.

writeLineSeparator can be overridden to trigger some specific output between lines.

有更多可以被覆盖的方法,但是您一般不太需要它们.

There are more methods which can be overridden but you are less likely to need them in general.

我正在使用PDFBox从pdf中提取信息,而我目前试图查找的信息与该行中第一个字符的x位置有关.

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.

这可以通过以下方式实现(只需在每行的开头添加信息):

This can be implemented as follows (simply adding the information at the start of each line):

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void startPage(PDPage page) throws IOException
    {
        startOfLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        startOfLine = true;
        super.writeLineSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        if (startOfLine)
        {
            TextPosition firstProsition = textPositions.get(0);
            writeString(String.format("[%s]", firstProsition.getXDirAdj()));
            startOfLine = false;
        }
        super.writeString(text, textPositions);
    }
    boolean startOfLine = true;
};

text = stripper.getText(document);

( ExtractText.java 方法extractLineStarttestExtractLineStartFromSampleFile测试)

(ExtractText.java method extractLineStart tested by testExtractLineStartFromSampleFile)

这篇关于使用PDFBox获取文本行的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆