搜索存储在Hadoop中的文档 - 使用哪个工具? [英] Searching over documents stored in Hadoop - which tool to use?

查看:159
本文介绍了搜索存储在Hadoop中的文档 - 使用哪个工具?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我迷失在:Hadoop,Hbase,Lucene,Carrot2,Cloudera,Tika,ZooKeeper,Solr,Katta,Cascading,POI ...

当你阅读关于你可以经常确定其他工具将被提及的那个。



我不指望你向我解释每一个工具 - 当然不是。如果你能帮我缩小这个特定场景的话,那就太好了。到目前为止,我不确定上述哪一种方法是合适的,它看起来像(一如既往)那里有更多的方法来完成要做的事情。



场景是:500GB - 存储在Hadoop中的约20 TB文档。多种格式的文本文档:电子邮​​件,文档,pdf,odt。有关存储在SQL数据库中的文档的元数据(发件人,收件人,日期,部门等)文档的主要来源将是ExchangeServer(电子邮件和附件),但不仅限于此。现在进行搜索:用户需要能够对这些文档进行复杂的全文搜索。基本上,他会看到一些搜索配置面板(Java桌面应用程序,而不是webapp) - 他会设置日期范围,文档类型,发件人/收件人,关键字等。 - 触发搜索并获取文档的结果列表(对于每个文档信息,为什么它包含在搜索结果中,即哪些关键字在文档中找到)。

我应该考虑哪些工具,哪些不是?重点是开发这样的解决方案只需要最少的胶水代码。我熟练使用SQLdbs,但对Apache和相关技术非常不舒服。

基本的工作流程如下所示:ExchangeServer / other source - >从doc / pdf / ... - >重复数据删除 - > Hadopp + SQL(元数据) - >建立/更新索引< - 在文档中搜索(并快速执行) - >显示搜索结果

谢谢!

解决方案

使用solr是一个不错的选择。我已经将它用于上面描述的类似场景。作为分布式索引服务器,您可以将solr用于真正的大数据。

但要获得有关所有这些文档格式的元数据,您应该使用其他工具。基本上你的工作流程就是这样。

1)使用hadoop集群存储数据。在hadoop集群中使用map / redcue

3)做文件识别(识别文件类型)

4)提取元数据。



5)在solr服务器中索引元数据,在数据库中存储其他提取信息

6)Solr服务器是分布式索引服务器,因此每次获取都可以创建新的分片或索引。


$ b

7)当需要搜索时,搜索所有索引。 / p>

8)Solr支持所有复杂搜索,因此您不必创建自己的搜索引擎。



9)它也为你分页。


I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...

When you read about the one you can be often sure that each of the others tools is going to be mentioned.

I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.

The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).

Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.

Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results

Thank you!

解决方案

Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.

But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.

1) Use hadoop cluster to store data.

2) Extract data in hadoop cluster using map/redcue

3) Do document identification( identify document type)

4) Extract meta data from these document.

5) Index metadata in solr server, store other ingestion information in database

6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.

7) When search is required search on all the indexs.

8) Solr supports all the complex searches , so you don't have to make your own search engine.

9) It also does paging for you as well.

这篇关于搜索存储在Hadoop中的文档 - 使用哪个工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆