PdfBox 文本提取无法正常工作 [英] PdfBox text extraction not working properly
问题描述
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
提取文本:http://pastebin.com/BXFfMy0z
问题 pdf:http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
如何从该 pdf 文件中提取正确的文本?
What can I do to extract correct text from this pdf file?
推荐答案
除了@karthik27 的回答:
Adobe Reader 在文本提取方面相当出色,因此通常可以用作指示是否可以从给定文档中提取文本.
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
因此,每当您有自己的文本提取无法处理的文档时,请在阅读器中打开它并尝试复制 &从它粘贴.如果这导致垃圾,很可能是 未正确编写文本提取,无论是错误的还是故意的.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
如果是您的文档,我确实会从 Adobe Reader 中复制和粘贴一些不可见和特殊字符的半随机集合,例如 you用 PDFBox 做的,即垃圾.因此,很可能只有 OCR 才能从中提取文本.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.
这篇关于PdfBox 文本提取无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!