PDFBox 在单词中添加空格 [英] PDFBox adding white spaces within words

查看:35
本文介绍了PDFBox 在单词中添加空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试从我的 PDF 文件中提取文本时,它似乎在几个单词之间随机插入空格.

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

我在本页下载部分的以下示例文件中使用 pdfbox-app-1.6.0.jar(最新版本):http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

我已经尝试过其他几个 PDF 文件,但在几个页面上似乎都在做同样的事情.

I've tried with several other PDF files and it seems to be doing same on several pages.

我执行以下操作:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped 训练pdf.pdf

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf

在下载的文件上,您将看到以下空格在控制台的结果中插入错误:• 如果孩子们能够步行到学校安全,这可以减少拥塞."

on the downloaded file and you will see spaces in following inserted wrongly in the result on console: "• If ch ildren are able to walk to schoo l safely this could reduce the congestion. "

• 为以后的生活培养良好的习惯."

"• Develops good hab its for later life."

www.sheff ield.gov.uk"

"www.sheff ield.gov.uk"

提前思考!,这是基于"

"Think Ahead!, wh ich is based on the"

等等等等

正如你所看到的,上面的几个单词之间有空格,我无法理解.

As you can see several of words above have spaces between them for no reason I can fathom.

我在 ubuntu 上运行 Sun 的 JDK 1.6.

I am on ubuntu and running Sun's JDK 1.6.

我在几个不同的 PDF 文件上尝试过这个,并尝试在论坛上搜索解决方案,有类似的错误,但似乎都已解决.

I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.

任何帮助或如果其他人有同样的问题,请发表评论.这在正确索引内容以进行搜索时造成了大问题.

Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

推荐答案

很遗憾,目前没有简单的解决方案.

Unfortunately there is currently no easy solution for this.

PDF 文档内部只包含诸如在 X 位置放置字符 'abc'"和在 Y 位置放置字符 'def'"之类的指令,并且 PDFBox 会尝试推断结果提取的文本应该是abc def"还是"abcdef"基于诸如 X 和 Y 之间的距离之类的东西.这些启发式通常非常准确,但正如您所看到的,它们并不总是产生正确的结果.

Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.

提高提取文本质量的一种方法是尝试对每个提取的单词或标记进行字典查找.如果查找失败,请尝试将令牌与下一个组合.如果对组合标记的字典查找成功,则很可能是文本提取器错误地在单词内添加了额外的空格.不幸的是,PDFBox 中尚不存在这样的功能.请参阅 https://issues.apache.org/jira/browse/PDFBOX-1153 为此提交的功能请求.欢迎补丁!

One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

这篇关于PDFBox 在单词中添加空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆