Solr的TikaEntityProcessor无法正常工作 [英] Solr's TikaEntityProcessor not working

查看:160
本文介绍了Solr的TikaEntityProcessor无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试让Solr索引数据库,其中一列是我要索引的PDF文档的文件名.我的配置如下:

I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this:

<dataConfig>
 <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
 <dataSource name="ds-file" type="BinFileDataSource"/>
 <document name="documents">
   <entity name="document" dataSource="ds-db" query="select * from documents">
     <entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text">
       <field column="text" />
     </entity>
   </entity>
 </document>
</dataConfig>

我正在从树干上使用Solr(截至上周).导入过程顺利完成,并且从数据库中选取了列,但没有从PDF文件中选取内容.它绝对是试图访问PDF文件,因为如果我给它一个不正确的路径名,它会抱怨.不过,它似乎并没有试图为PDF编制索引,因为它大约需要40毫秒才能完成,而如果我通过ExtractingRequestHandler导入PDF,则大约需要11秒钟才能为它编制索引.

I'm using Solr from trunk (as of last week). The import process completes without errors, and it picks up the columns from the database, but not the content from the PDF file. It is definitely trying to access the PDF file, for if I give it an incorrect path name, it complains. It doesn't seem to be attempting to index the PDF, though, as it completes in about 40ms, whereas if I import the PDF via the ExtractingRequestHandler, it takes about 11 seconds to index it.

我还尝试了example-DIH中的tika示例,而且似乎也没有索引任何内容.我是在做错什么,还是这还行不通?

I've also tried the tika example in example-DIH and that doesn't seem to index anything, either. Am I doing something wrong, or is this just not working yet?

我正在OSX 10.6.3上运行Java 1.6.0_20.

I'm running Java 1.6.0_20 on OSX 10.6.3.

(我应该注意,我已经将其发布在solr-user邮件列表中,没有得到答案.)

(I should note that I already posted this on the solr-user mailing list and didn't get an answer.)

推荐答案

solr用户邮件列表中的某人具有答案:

Someone on the solr-user mailing list had the answer: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

基本上,Apache Tika中有一个错误是在版本0.6之后引入的,并且显然仍存在于Solr的主干中的0.8快照中.下载Tika 0.6(来自 http://archive.apache.org/dist/lucene/tika/),然后将tika-core-0.6.jar和tika-parsers-0.6.jar复制到路径中即可解决该问题.

Basically, there's a bug in Apache Tika that was introduced after version 0.6, and it is apparently still present in the 0.8 snapshot that is currently in Solr's trunk. Downloading Tika 0.6 (from http://archive.apache.org/dist/lucene/tika/) and copying tika-core-0.6.jar and tika-parsers-0.6.jar into the path fixed the issue.

这篇关于Solr的TikaEntityProcessor无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆