使用solr 6.4.1配置Tesseract [英] Configure Tesseract with solr 6.4.1

查看:138
本文介绍了使用solr 6.4.1配置Tesseract的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用solr 6.4.1配置Tika OCR. 我为包括PDF,图像和MS Office文档的文档建立了索引,但是出现了问题,Tika并未从图像以及PDF和MS Office文档内部的图像中提取文本.为此,我研究了使用Tika OCR. 为此,我正在安装tika-app-1.7.jar和Tesseract,但我不知道如何使用Solr内核对其进行配置.

How to configure Tika OCR with solr 6.4.1. I indexed documents including PDF, images and MS office documents but problem was occurred Tika was not extracting text from images and also from images which are inside PDF and MS office documents. for this I researched Tika OCR is used. for this purpose i am installing tika-app-1.7.jar and Tesseract but i don't know how to configure them with my solr core.

推荐答案

您无需执行任何特殊操作.只需为您的发行版获取Tesseract OCR设置,然后在系统上安装 .确保您的PATH变量具有Tesseract主目录的条目,并且已设置TESSDATA_PREFIX变量并指向Tesseract主目录.重新启动Solr,您一切顺利.通过/update/extract处理程序将文档推入索引时,您应该能够看到OCR组件.

You don't need to do anything special. Simply get the Tesseract OCR setup for your distro and install it on the system. Make sure your PATH variable has an entry for the Tesseract home directory, and the TESSDATA_PREFIX variable is set and also points to the Tesseract home directory. Restart Solr and you're good to go. You should be able to see the OCR component when you push documents to the index through the /update/extract handler.

默认情况下,Tesseract仅随附英语模型.从此处获取其他语言的模型.

By default, Tesseract only ships with the English model. Get models for other languages from here.

这篇关于使用solr 6.4.1配置Tesseract的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆