Solr 的 TikaEntityProcessor 不工作 [英] Solr's TikaEntityProcessor not working

查看:25
本文介绍了Solr 的 TikaEntityProcessor 不工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让 Solr 索引一个数据库,其中一列是我想要索引的 PDF 文档的文件名.我的配置如下:

<dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/><dataSource name="ds-file" type="BinFileDataSource"/><文档名称=文档"><entity name="document" dataSource="ds-db" query="select * from documents"><entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text"><field column="text"/></实体></实体></文档></dataConfig>

我正在使用主干中的 Solr(截至上周).导入过程无误地完成,它从数据库中选取列,而不是从 PDF 文件中选取内容.它肯定是在尝试访问 PDF 文件,因为如果我给它一个不正确的路径名,它就会抱怨.不过,它似乎并没有尝试索引 PDF,因为它在大约 40 毫秒内完成,而如果我通过 ExtractingRequestHandler 导入 PDF,则需要大约 11 秒来索引它.

我也尝试过 example-DIH 中的 tika 示例,但它似乎也没有索引任何内容.是我做错了什么,还是这还没有用?

我在 OSX 10.6.3 上运行 Java 1.6.0_20.

(我应该注意到我已经在 solr-user 邮件列表上发布了这个,但没有得到答复.)

解决方案

solr-user 邮件列表上有人给出了答案:http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

基本上,Apache Tika 中存在一个在 0.6 版本之后引入的错误,并且它显然仍然存在于当前位于 Solr 主干中的 0.8 快照中.下载 Tika 0.6(来自 http://archive.apache.org/dist/lucene/tika/) 并将 tika-core-0.6.jar 和 tika-parsers-0.6.jar 复制到路径中解决了该问题.

I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this:

<dataConfig>
 <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
 <dataSource name="ds-file" type="BinFileDataSource"/>
 <document name="documents">
   <entity name="document" dataSource="ds-db" query="select * from documents">
     <entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text">
       <field column="text" />
     </entity>
   </entity>
 </document>
</dataConfig>

I'm using Solr from trunk (as of last week). The import process completes without errors, and it picks up the columns from the database, but not the content from the PDF file. It is definitely trying to access the PDF file, for if I give it an incorrect path name, it complains. It doesn't seem to be attempting to index the PDF, though, as it completes in about 40ms, whereas if I import the PDF via the ExtractingRequestHandler, it takes about 11 seconds to index it.

I've also tried the tika example in example-DIH and that doesn't seem to index anything, either. Am I doing something wrong, or is this just not working yet?

I'm running Java 1.6.0_20 on OSX 10.6.3.

(I should note that I already posted this on the solr-user mailing list and didn't get an answer.)

解决方案

Someone on the solr-user mailing list had the answer: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

Basically, there's a bug in Apache Tika that was introduced after version 0.6, and it is apparently still present in the 0.8 snapshot that is currently in Solr's trunk. Downloading Tika 0.6 (from http://archive.apache.org/dist/lucene/tika/) and copying tika-core-0.6.jar and tika-parsers-0.6.jar into the path fixed the issue.

这篇关于Solr 的 TikaEntityProcessor 不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆