PDFBox:提取文本时维护PDF结构 [英] PDFBox : Maintaining PDF structure when extracting text

查看:635
本文介绍了PDFBox:提取文本时维护PDF结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从充满表格的PDF中提取文本。
在某些情况下,列是空的。
当我从PDF中提取文本时,emptys列被滑动并被空格替换,因此,我的常规表达式无法弄清楚在这个位置有一个没有信息的列。



图片以便更好地理解:





我们可以看到在提取的文本中没有遵守列



从我的代码中提取文本的示例PDF:

  PDFTextStripper reader = new PDFTextStripper(); 
reader.setSortByPosition(true);
reader.setStartPage(page);
reader.setEndPage(page);
String st = reader.getText(document);
List< String> lines = Arrays.asList(st.split(System.getProperty(line.separator)));

如何在从中提取文本时保持原始PDF的完整结构?



非常感谢。

解决方案

(原来是答案(2015年2月6日)到另一个问题,OP删除了包括所有答案。由于年龄,答案中的代码仍然基于PDFBox 1.8.x,因此可能需要进行一些更改才能使其与PDFBox 2.0.x一起运行。)



在评论中,OP显示感兴趣的解决方案扩展PDFBox PDFTextStripper 以返回试图反映PDF文件布局的文本行,如果问题可能有帮助这是一个概念验证的概念:

 <

$ b

c $ c>公共类LayoutTextStripper扩展PDFTextStripper
{
public LayoutTextStripper()抛出IOException
{
super();
}

@Override
protected void startPage(PDPage page)throws IOException
{
super.startPage(page);
cropBox = page.findCropBox();
pageLeft = cropBox.getLowerLeftX();
beginLine();
}

@Override
protected void writeString(String text,List< TextPosition> textPositions)抛出IOException
{
float recentEnd = 0;
for(TextPosition textPosition:textPositions)
{
String textHere = textPosition.getCharacter();
if(textHere.trim()。length()== 0)
继续;

float start = textPosition.getTextPos()。getXPosition();
boolean spacePresent = endsWithWS | textHere.startsWith();

if(needsWS | spacePresent | Math.abs(start - recentEnd)> 1)
{
int spacesToInsert = insertSpaces(chars,start,needsWS&!spacePresent) ;

for(; spacesToInsert> 0; spacesToInsert--)
{
writeString();
chars ++;
}
}

writeString(textHere);
chars + = textHere.length();

needsWS = false;
endsWithWS = textHere.endsWith();
尝试
{
recentEnd = getEndX(textPosition);
}
catch(IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
{
抛出新IOException(检索TextPosition的endX失败,e);
}
}
}

@Override
protected void writeLineSeparator()抛出IOException
{
super.writeLineSeparator() ;
beginLine();
}

@Override
protected void writeWordSeparator()抛出IOException
{
needsWS = true;
}

void beginLine()
{
endsWithWS = true;
needsWS = false;
chars = 0;
}

int insertSpaces(int charsInLineAlready,float chunkStart,boolean spaceRequired)
{
int indexNow = charsInLineAlready;
int indexToBe =(int)((chunkStart - pageLeft)/ fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if(spacesToInsert< 1&& spaceRequired)
spacesToInsert = 1;

返回spacesToInsert;
}

float getEndX(TextPosition textPosition)抛出IllegalArgumentException,IllegalAccessException,NoSuchFieldException,SecurityException
{
Field field = textPosition.getClass()。getDeclaredField(endX );
field.setAccessible(true);
return field.getFloat(textPosition);
}

public float fixedCharWidth = 3;

boolean endsWithWS = true;
boolean needsWS = false;
int chars = 0;

PDRectangle cropBox = null;
float pageLeft = 0;
}

它的使用方式如下:

  PDDocument document = PDDocument.load(PDF); 

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; //例如5

String text = stripper.getText(document);

fixedCharWidth 是假设的字符宽度。根据所讨论的PDF中的写作,不同的值可能更适合。在我的示例文档中,感兴趣的是3..6的值。



它基本上模拟了iText的类似解决方案这个答案。但是,结果略有不同,因为iText文本提取转发文本块,PDFBox文本提取转发单个字符。



请注意,这仅仅是一个概念验证。它特别不考虑任何轮换


I'm trying to extract text from a PDF which is full of tables. In some cases, a column is empty. When I extract the text from the PDF, the emptys columns are skiped and replaced by a whitespace, therefore, my regulars expressions can't figure out that there was a column with no information at this spot.

Image to a better understanding :

We can see that the columns aren't respected in the extracted text

Sample of my code that extract the text from PDF :

PDFTextStripper reader = new PDFTextStripper();
            reader.setSortByPosition(true);
            reader.setStartPage(page);
            reader.setEndPage(page);
            String st = reader.getText(document);
            List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

How to maintain the full structure of the original PDF when extracting text from it ?

Thank's a lot.

解决方案

(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)

In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.

A proof-of-concept for that would be this class:

public class LayoutTextStripper extends PDFTextStripper
{
    public LayoutTextStripper() throws IOException
    {
        super();
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        super.startPage(page);
        cropBox = page.findCropBox();
        pageLeft = cropBox.getLowerLeftX();
        beginLine();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        float recentEnd = 0;
        for (TextPosition textPosition: textPositions)
        {
            String textHere = textPosition.getCharacter();
            if (textHere.trim().length() == 0)
                continue;

            float start = textPosition.getTextPos().getXPosition();
            boolean spacePresent = endsWithWS | textHere.startsWith(" ");

            if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
            {
                int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);

                for (; spacesToInsert > 0; spacesToInsert--)
                {
                    writeString(" ");
                    chars++;
                }
            }

            writeString(textHere);
            chars += textHere.length();

            needsWS = false;
            endsWithWS = textHere.endsWith(" ");
            try
            {
                recentEnd = getEndX(textPosition);
            }
            catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
            {
                throw new IOException("Failure retrieving endX of TextPosition", e);
            }
        }
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();
        beginLine();
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
        needsWS = true;
    }

    void beginLine()
    {
        endsWithWS = true;
        needsWS = false;
        chars = 0;
    }

    int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
    {
        int indexNow = charsInLineAlready;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;

        return spacesToInsert;
    }

    float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
    {
        Field field = textPosition.getClass().getDeclaredField("endX");
        field.setAccessible(true);
        return field.getFloat(textPosition);
    }

    public float fixedCharWidth = 3;

    boolean endsWithWS = true;
    boolean needsWS = false;
    int chars = 0;

    PDRectangle cropBox = null;
    float pageLeft = 0;
}

It is used like this:

PDDocument document = PDDocument.load(PDF);

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5

String text = stripper.getText(document);

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.

It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.

Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account

这篇关于PDFBox:提取文本时维护PDF结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆