使用 Solr CELL 的 ExtractingRequestHandler 从包格式中索引/提取文件 [英] Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

查看:34
本文介绍了使用 Solr CELL 的 ExtractingRequestHandler 从包格式中索引/提取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以将 ExtractingRequestHandler 和 Tika 与任何一个一起使用吗?压缩文件格式(zip、tar、gz 等)以提取内容以进行索引?

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

我正在使用 curl 向 solr 发送 archived.tar 文件.卷曲"http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true"-H '内容类型:应用程序/八位字节流' --data-binary@/home/archived.tar"我查询文档时得到的结果是存档被索引为body_texts",但这些文件的内容是未提取或包含.这不是我期望的行为.参考:http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.当我使用相同的 curl 在存档中发送 1 个实际文档时命令提取的内容然后存储在body_texts"字段中.是我错过了压缩文件的步骤?

I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts", but the content of those files is not extracted or included. This is not the behavior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example. When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the "body_texts" field. Am I missing a step for the compressed files?

我已经添加了所有提取依赖项,如 mat 中所示http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell 和能够成功地从 MS Word、PDF、HTML 文档中提取数据.

I have added all the extraction dependencies as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to successfully extract data from MS Word, PDF, HTML documents.

我正在使用以下库版本.Solr 1.40、Solr Cell 1.4.1、Tika Core 0.4

I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

鉴于我读过的所有内容,这个版本的 Tika 应该支持提取压缩文件中所有文件的数据.任何帮助或建议都会受到赞赏.

Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.

推荐答案

简短回答:Solr Cell 1.4.1 和 Tika Core 0.6.

The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

长答案:经过很多头痛,我能够让它工作.我将为直接使用 solr 的人和将 solr 与 Ruby 库 sunspot 一起使用的人回答这个问题(这是我的问题).

The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

这就是我所做的:我使用了这个 https://github.com/tomasc/sunspot_cell插件来扩展太阳黑子并赋予它附件功能.(如果您不使用 ruby​​/sunspot,请忽略此步骤)

Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot)

v1.4.1 适用于单个文件,但不适用于压缩文件,因此我不得不进行一些探索.我从 http://lucene.apache.org/solr/ 下载了 v1.4.1 代码库并抓取dist/apache-solr-cell-1.4.1.jar 然后我不得不从 1.5 分支中拉下 Tika 库 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

您可以单独下载每个,也可以使用svn通过

You can download each individually, or you can use svn to checkout the branch by

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev

或者只是签出库文件夹:

Or just checkout the library folder:

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/

这篇关于使用 Solr CELL 的 ExtractingRequestHandler 从包格式中索引/提取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆