使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件 [英] Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

查看:259
本文介绍了使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以使用ExtractingRequestHandler和Tika以及任何
压缩文件格式(zip,tar,gz等)来提取索引内容吗?



我使用curl发送solr archived.tar文件。 curl
http:/ /localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true
-H'Content-type:application / octet-stream' - data-binary
@ / home / archived.tar
当我查询文档时,得到的结果是
归档文件中的文件名索引为body_texts,但这些文件的内容是
未提取或包含的内容。这不是我预期的行为。 Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
当我使用相同的curl
命令将一个实际文档发送到存档中时,提取的内容将存储在body_texts字段中。 Am
我缺少一个压缩文件的步骤吗?

我添加了所有提取依赖关系,如
中所示的提取依赖关系 http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell
能够成功地从MS Word,PDF,HTML文档中提取数据。



我正在使用以下库版本。
Solr 1.40,Solr Cell 1.4.1,Tika Core 0.4

鉴于我已阅读的所有内容,此版本的Tika应支持从
数据中提取
数据压缩文件中的所有文件。任何帮助或建议
将不胜感激。

解决方案

简短的回答:Solr Cell 1.4.1和Tika Core 0.6。
$ b $长久的回答:在经历了很多令人头疼的事情之后,我能够得到这个工作。我会直接为使用solr的人以及使用solr与Ruby库中的太阳黑子(这是我的问题)回答它。



这是我做的:I使用此 https://github.com/tomasc/sunspot_cell 插件来扩展太阳黑子并给它附加功能。 (如果不使用ruby / sunspot,请忽略此步骤)

v1.4.1适用于单个文件,但不适用于压缩文件,因此我不得不探索一下。我从 http://lucene.apache.org/solr/ 下载了v1.4.1代码库并抓取dist / apache-solr-cell-1.4.1.jar,然后我必须从1.5分支 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/

您可以单独下载每个文件,或者您可以使用svn来检出分支

  svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev 

或者只是检出库文件夹:

  svn co http: //svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/ 


Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts", but the content of those files is not extracted or included. This is not the behavior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example. When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the "body_texts" field. Am I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to successfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.

解决方案

The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot)

v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

You can download each individually, or you can use svn to checkout the branch by

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev

Or just checkout the library folder:

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/

这篇关于使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆