使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件 [英] Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats
问题描述
您可以使用ExtractingRequestHandler和Tika以及任何
压缩文件格式(zip,tar,gz等)来提取索引内容吗?
我使用curl发送solr archived.tar文件。 curl
http:/ /localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true
-H'Content-type:application / octet-stream' - data-binary
@ / home / archived.tar
当我查询文档时,得到的结果是
归档文件中的文件名索引为body_texts,但这些文件的内容是
未提取或包含的内容。这不是我预期的行为。 Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example 。
当我使用相同的curl
命令将一个实际文档发送到存档中时,提取的内容将存储在body_texts字段中。 Am
我缺少一个压缩文件的步骤吗?
我添加了所有提取依赖关系,如
中所示的提取依赖关系 http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell 和
能够成功地从MS Word,PDF,HTML文档中提取数据。
我正在使用以下库版本。
Solr 1.40,Solr Cell 1.4.1,Tika Core 0.4
鉴于我已阅读的所有内容,此版本的Tika应支持从
数据中提取
数据压缩文件中的所有文件。任何帮助或建议
将不胜感激。
简短的回答:Solr Cell 1.4.1和Tika Core 0.6。
$ b $长久的回答:在经历了很多令人头疼的事情之后,我能够得到这个工作。我会直接为使用solr的人以及使用solr与Ruby库中的太阳黑子(这是我的问题)回答它。
这是我做的:I使用此 https://github.com/tomasc/sunspot_cell 插件来扩展太阳黑子并给它附加功能。 (如果不使用ruby / sunspot,请忽略此步骤)
v1.4.1适用于单个文件,但不适用于压缩文件,因此我不得不探索一下。我从 http://lucene.apache.org/solr/ 下载了v1.4.1代码库并抓取dist / apache-solr-cell-1.4.1.jar,然后我必须从1.5分支 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/ 您可以单独下载每个文件,或者您可以使用svn来检出分支 或者只是检出库文件夹: Can you use ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing? I am sending solr the archived.tar file using curl. curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true"
-H 'Content-type:application/octet-stream' --data-binary
"@/home/archived.tar"
The result I get when I query the document is that the file names inside the
archive are indexed as the "body_texts", but the content of those files is
not extracted or included. This is not the behavior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the "body_texts" field. Am
I missing a step for the compressed files? I have added all the extraction dependencies as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to successfully extract data from MS Word, PDF, HTML documents. I'm using the following library versions.
Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4 Given everything I have read this version of Tika should support extracting
data from all files within a compressed file. Any help or suggestions would
be appreciated. The short answer: Solr Cell 1.4.1 and Tika Core 0.6. The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem). Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot) v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/. You can download each individually, or you can use svn to checkout the branch by Or just checkout the library folder:
这篇关于使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev
svn co http: //svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/
svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev
svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/