PDFBox在单词中添加空格 [英] PDFBox adding white spaces within words

查看:253
本文介绍了PDFBox在单词中添加空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试从PDF文件中提取文本时,似乎在多个单词之间随机插入空格.

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

我正在此页面的下载"部分中的以下示例文件上使用pdfbox-app-1.6.0.jar(最新版本): http://www.sheffield.gov.uk/roads /children/parents/6-11/pedestrian-training

I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page : http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training

我尝试了其他几个PDF文件,但似乎在多个页面上都做同样的事情.

I've tried with several other PDF files and it seems to be doing same on several pages.

我执行以下操作:

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console〜/Desktop/ped培训pdf.pdf

java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf

在下载的文件上,您将在控制台的结果中看到以下错误插入的空格: •如果孩子们能够步行到 安全地这样做可以减少 拥塞. "

on the downloaded file and you will see spaces in following inserted wrongly in the result on console: "• If ch ildren are able to walk to schoo l safely this could reduce the congestion. "

•为以后的生活发展良好习惯."

"• Develops good hab its for later life."

"www.sheff ield.gov.uk"

"www.sheff ield.gov.uk"

想一想!,这是基于"

"Think Ahead!, wh ich is based on the"

等等等

您可以看到上面的几个单词之间都留有空格,这是我无法理解的原因.

As you can see several of words above have spaces between them for no reason I can fathom.

我在ubuntu上运行Sun的JDK 1.6.

I am on ubuntu and running Sun's JDK 1.6.

我已经在几个不同的PDF文件上进行了尝试,并尝试在论坛上搜索解决方案,虽然存在类似的错误,但似乎都已解决.

I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.

任何帮助或任何其他人有相同的问题,请发表评论.这会导致在正确索引内容以进行搜索时出现大问题.

Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

推荐答案

不幸的是,目前尚无简便的解决方案.

Unfortunately there is currently no easy solution for this.

内部PDF文档仅包含在X位置放置字符'abc'和在Y位置放置字符'def'之类的指令,而PDFBox尝试推断提取的结果文本应为"abc def"还是"abcdef",例如X和Y之间的距离.这些启发式方法通常非常准确,但是如您所见,它们并不总是能产生正确的结果.

Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.

提高提取文本质量的一种方法是尝试对每个提取的单词或标记进行字典查找.如果查找失败,请尝试将令牌与下一个令牌合并.如果对组合标记的字典查找成功,则很可能是文本提取器错误地在单词内添加了额外的空格.不幸的是,PDFBox中尚不存在这样的功能.参见 https://issues.apache.org/jira/browse/PDFBOX-1153针对为此提交的功能请求.欢迎补丁!

One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

这篇关于PDFBox在单词中添加空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆