使用Solr配置Tika [英] Configuring Tika With Solr

查看:445
本文介绍了使用Solr配置Tika的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找将Rich类型的文档(Pdf,Doc,rtf,txt)编入Solr的索引.我找到了Tika作为解决方案.我在网络上大声疾呼,但没有找到任何文档/链接使其可与ExtractingRequestHandler配合使用.

I am Looking to index Rich types documents(Pdf, Doc, rtf, txt) into Solr. I found Tika as a solution. I made a rant over the web but didn't found any Docs/links to make it work with ExtractingRequestHandler.

任何人都可以提供使用ExtractingRequestHandler配置Tika的分步方法.

Anyone can please provide step by step way to configure Tika with ExtractingRequestHandler.

先谢谢了:)

推荐答案

检查 ExtractingRequestHandler 以进行集成Tika代表Solr的作品.
Solr提供了内置的tika.config,除非覆盖该配置,否则无需定义它.
您可以使用solrconfig.xml

Check ExtractingRequestHandler for Integration of Solr with Tika.
Solr provides tika.config inbuilt and you would not need to define it unless overriding the config.
You can go with the default config as defined in the solrconfig.xml

<!-- Solr Cell Update Request Handler

   http://wiki.apache.org/solr/ExtractingRequestHandler 

-->
<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

您可以使用命令为文件编制索引以使用其他元数据进行solr.

You can use the commands to index the files to solr with additional metadata.

curl "http://localhost:8983/solr/update/extract?literal.id=2&literal.title=Test&commit=true&fmap.content=text" -F "myfile=@1.pdf"

默认情况下,文件的内容被复制到内容字段并复制到文本,您可以覆盖设置.

By default the content of the files are copied to content field and copied over to text, you can override the settings.

这篇关于使用Solr配置Tika的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆