Apache PDFBox 删除字符之间的空格 [英] Apache PDFBox Remove Spaces between characters

查看:60
本文介绍了Apache PDFBox 删除字符之间的空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 PDFBox 从 PDF 中提取文本.

某些 PDF 的文本无法正确提取.下图显示了 PDF 中的一部分作为图像:

文本提取后,我们得到以下文本:
3, 8 5 EU R 1 Netto 38,50 EUR 4,00
(','和'8'之间加空格)

这是我们的代码:

 PDDocument pdf = PDDocument.load(reuseableInputStream);PDFTextStripper pdfStripper = new PDFTextStripper();pdfStripper.setSortByPosition(true);字符串文本 = pdfStripper.getText(pdf);

我们尝试使用 PDFTextStripper 属性AverageCharTolerance"和SpacingTolerance"但没有任何积极影响.

替代库iText"正确提取文本,字符之间没有空格.但由于许可问题,我们无法使用它.

有什么想法吗?谢谢.

我们使用的是 1.8.9 版.我们也尝试了快照版本 2.0.0,但没有效果.

解决方案

原因

检查 OP 提供的文件,结果发现问题是由实际上存在额外空间引起的!从同一个起始位置绘制了多个字符串;在每个位置至多这些字符串之一具有非空格字符.因此,PDF 查看器输出看起来不错,但作为文本提取器的 PDFBox 试图利用找到的所有字符,包括那些额外的空格字符.

可以使用带有此内容流的 PDF 复制该行为,F0Courier:

BT/F0 9 Tf100 500 Td( 2 Netto 5,00 EUR 3,00) Tj0 0 天( 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1) TjET

在 PDF 查看器中,这看起来像这样:

复制&从 Adob​​e Reader 粘贴结果

2 2 8 8 2 8 9 2 E N E R G I Z E L R 6 I n d u s t r i a l 2 , 5 0 E U R 1 Netto 5,00 3,00 欧元

使用 PDFBox 的常规提取结果

 2 2 8 8 2 89 2 E N E RG IZ E L R 6 I n du s t ri al 2 ,5 0 EU R 1 Netto 5,00 EUR 3,00

因此,不仅 PDFBox 在这里有问题,这两个输出看起来不同,但额外的空格也是一个问题.

我建议告诉这些 PDF 的制作者,即使对于 Adob​​e Reader 等广泛使用的软件,它们也很难进行后期处理.

解决办法

为了从中提取一些合理的东西,我们必须以某种方式忽略(实际存在!)额外的空间.由于无法临时知道哪些空格可以稍后使用,哪些不能使用,我们只需删除所有空格并希望 PDFBox 在必要时添加空格:

String extractNoSpaces(PDDocument document) 抛出 IOException{PDFTextStripper 剥离器 = 新 PDFTextStripper(){@覆盖protected void processTextPosition(TextPosition text){字符串字符 = text.getCharacter();if (character != null && character.trim().length() != 0)super.processTextPosition(text);}};stripper.setSortByPosition(true);返回 stripper.getText(document);}

(ExtractWithoutExtraSpaces.java)

对我们得到的测试文档使用这个方法:

2 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1 Netto 5,00 EUR 3,00

不同的文本提取器

<块引用>

替代库iText"正确提取文本,字符之间没有空格

这是因为 iText 是逐个字符串提取文本,而不是逐个字符.此过程有其自身的危险,但在这种情况下,会产生一些更有用的开箱即用的东西.

We are using PDFBox to extract text from PDF's.

Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image:

After text extraction we get the following text:
3, 8 5 EU R 1 Netto 38,50 EUR 4,00
(Spaces are added between ',' and '8')

Here is our code:

            PDDocument pdf = PDDocument.load(reuseableInputStream);
            PDFTextStripper pdfStripper = new PDFTextStripper();
            pdfStripper.setSortByPosition(true);
            String text = pdfStripper.getText(pdf);

We tried to play with the PDFTextStripper attributes 'AverageCharTolerance' and 'SpacingTolerance' with no positive effect.

The alternative libary 'iText' extract the text correctly without spaces between the characters. But we can't use it because of license problems.

Any ideas? Thank you.

EDIT: We are using version 1.8.9. We tried also the snapshot version 2.0.0 with no effect.

解决方案

The cause

Inspecting the file provided by the OP it turns out that the issue is caused by extra spaces actually being there! There are multiple strings drawn from the same starting position; at every position at most one of those strings has a non-space character. Thus, the PDF viewer output looks good, but PDFBox as text extractor tries to make use of all characters found including those extra space characters.

The behavior can be reproduced using a PDF with this content stream with F0 being Courier:

BT
/F0 9 Tf
100 500 Td
(             2                                                                  Netto        5,00 EUR 3,00) Tj
0 0 Td
(                2882892  ENERGIZE LR6 Industrial                     2,50 EUR 1) Tj
ET

In a PDF viewer this looks like this:

Copy & paste from Adobe Reader results in

2 2 8 8 2 8 9 2 E N E R G I Z E L R 6 I n d u s t r i a l 2 , 5 0 E U R 1 Netto 5,00 EUR 3,00

Regular extraction using PDFBox results in

             2    2 8 8 2 89 2    E N E RG  IZ  E  L R 6  I n du s t  ri  a l                      2 ,5  0  EU  R  1 Netto        5,00 EUR 3,00

Thus, not only PDFBox has problems here, these two outputs look different but the extra spaces are a problem either way.

I would propose telling the producer of those PDFs that they are difficult to post-process, even for widely-used software like Adobe Reader.

A work-around

To extract something sensible from this we have to somehow ignore the (actually existing!) extra spaces. As there is no way to ad hoc know which spaces can be used later on and which not, we simply remove all and hope PDFBox adds spaces where necessary:

String extractNoSpaces(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void processTextPosition(TextPosition text)
        {
            String character = text.getCharacter();
            if (character != null && character.trim().length() != 0)
                super.processTextPosition(text);
        }
    };
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(ExtractWithoutExtraSpaces.java)

Using this method with the test document we get:

2 2882892 ENERGIZE LR6 Industrial 2,50 EUR 1 Netto 5,00 EUR 3,00

Different text extractors

The alternative libary 'iText' extract the text correctly without spaces between the characters

This is due to iText extracting text string by string, not character by character. This procedure has its own perils but in this case results in something more usable out-of-the-box.

这篇关于Apache PDFBox 删除字符之间的空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆