PdfBox 文本提取无法正常工作 [英] PdfBox text extraction not working properly

查看:59
本文介绍了PdfBox 文本提取无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);

提取文本:http://pastebin.com/BXFfMy0z

问题 pdf:http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf

如何从该 pdf 文件中提取正确的文本?

What can I do to extract correct text from this pdf file?

推荐答案

除了@karthik27 的回答:

Adobe Reader 在文本提取方面相当出色,因此通常可以用作指示是否可以从给定文档中提取文本.

Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.

因此,每当您有自己的文本提取无法处理的文档时,请在阅读器中打开它并尝试复制 &从它粘贴.如果这导致垃圾,很可能是 未正确编写文本提取,无论是错误的还是故意的.

Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.

如果是您的文档,我确实会从 Adob​​e Reader 中复制和粘贴一些不可见和特殊字符的半随机集合,例如 you用 PDFBox 做的,即垃圾.因此,很可能只有 OCR 才能从中提取文本.

In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.

这篇关于PdfBox 文本提取无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆