无法使用TesseractOCRConfig Apache Tika提取扫描的pdf [英] Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

查看：187 发布时间：2018/12/19 22:18:55 java parsing pdf ocr apache-tika

本文介绍了无法使用TesseractOCRConfig Apache Tika提取扫描的pdf的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的pdf包含扫描图像，我想从中提取文本。

My pdf contains scanned images and I want to extract text from it.

我尝试过：我尝试使用AutoDetectParsers但没有输出。

What I tried : I tried with AutoDetectParsers but no output.

我遵循 Apache Tika提取扫描的PDF文件以及Apache Tika Jira的 https ：//issues.apache.org/jira/browse/TIKA-1729 但是没有任何错误就是空字符串。

I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error.

我的配置：Win 7 64-位OS，JDK 1.8.0_45。

My configuration : Win 7 64-bit OS, JDK 1.8.0_45.

欢迎任何形式的帮助。

推荐答案

要解决此问题的步骤：

使用系统安装Tesseract用于Windows的'tesseract-ocr-setup-3.05.00dev.exe'来自： https://sourceforge.net/projects/tesseract-ocr-alt/files/ 并在配置中设置其位置。

Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

Java代码：

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tPath);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
//need to add this to make sure recursive parsing happens!
parseContext.set(Parser.class, parser);

Maven依赖项：

Maven dependencies :

< dependencies> < dependency> < groupId> org.apache.tika< / groupId> < artifactId> tika-parsers< / artifactId> < version> 1.13< / version> < / dependency> < dependency> < groupId> com.levigo.jbig2< / groupId> < artifactId> levigo-jbig2-imageio< / artifactId> < version> 1.6.5< / version> < / dependency> < dependency> < groupId> com.github.jai-imageio< / groupId> < artifactId> jai-imageio-core< / artifactId> < version> 1.3.1< / version> < / dependency> < / dependencies>

我认为这可能会有所帮助。谢谢。

I think it may be helpful. Thanks.

这篇关于无法使用TesseractOCRConfig Apache Tika提取扫描的pdf的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法使用TesseractOCRConfig Apache Tika提取扫描的pdf [英] Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

无法使用TesseractOCRConfig Apache Tika提取扫描的pdf [英] Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭