使用 ElasticSearch 和/或 Solr 作为 MS Office 和 PDF 文档的数据存储 [英] Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

查看:18
本文介绍了使用 ElasticSearch 和/或 Solr 作为 MS Office 和 PDF 文档的数据存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在设计一个全文搜索系统,用户可以在其中对 MS Office 和 PDF 文档执行文本查询,结果将返回与查询最匹配的文档列表.然后,用户将选择返回的任何文档并在 MS Word、Excel 或 PDF 查看器中查看该文档.

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.

我可以使用 ElasticSearch 或 Solr 将原始二进制文档(即 .docx、.xlsx、.pdf 文件)导入其数据存储",然后根据命令将文档导出到用户的设备以供查看.

Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.

以前,我使用 MongoDB 2.6.6 将原始文件导入 GridFS,并将提取的文本导入一个单独的集合(该集合包含一个文本索引)并且工作正常.但是,MongoDB 全文搜索非常基础,因此我现在正在考虑使用 Solr 或 ElasticSearch 来执行更复杂的文本搜索.

Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.

尼克

推荐答案

Solr 和 Elasticsearch 都会索引文档的内容.Solr 有内置的,Elasticsearch 需要一个插件.无论哪种方式都很简单,并且都在幕后使用 Tika.

Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.

它们都不会存储文档本身.您可以尝试让他们这样做,但它们不是为此而设计的,您会受苦.

Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.

此外,目前无论是 Solr 还是 Elasticsearch 都不推荐作为主存储.他们可以做到这一点,但对他们来说并不像文件系统实现那样关键.

Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.

因此,我建议将文件放在其他地方,并仅使用 Solr/Elasticsearch 进行搜索.那就是他们发光的地方.

So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.

这篇关于使用 ElasticSearch 和/或 Solr 作为 MS Office 和 PDF 文档的数据存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆