使用OCR的PDF文本提取方法 [英] PDF Text Extraction Approach Using OCR

查看:343
本文介绍了使用OCR的PDF文本提取方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人尝试使用OCR库和Java从PDF提取文本吗?您发现什么是最可靠的文本提取库.我见过的大多数方法(tesseract,GOCR)都是C库,需要编写一些JNI代码.

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.

我熟悉pdfbox,它现在是一个Apache孵化器项目,版本为0.8.x,但是它的文本提取并不总是准确的.我正在寻找一种更可靠的替代方法.

I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.

在尝试过程中,我还没有尝试过Asprise JavaPDF,但想进一步了解OCR方法(如果可能).

I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).

任何帮助将不胜感激.

Any help would be appreciated.

推荐答案

如果您有基于文本的PDF,我强烈建议

If you have a text-based PDF, I'd strongly recommend PDFTextStream. It's not free, but licensing is reasonable, and it is much much better than PDFBox. PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle. PDFTextStream handles any PDF I throw at it, including PDFs with embedded PNG images, which PDFBox can not do.

如果您欺骗PDFTextStream的人们添加OCR,他们可能会听.

If you heckle the PDFTextStream folks to add OCR, they may listen up.

这篇关于使用OCR的PDF文本提取方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆