搜索存储在 Hadoop 中的文档 - 使用哪种工具? [英] Searching over documents stored in Hadoop - which tool to use?

查看:21
本文介绍了搜索存储在 Hadoop 中的文档 - 使用哪种工具?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我迷失在:Hadoop、Hbase、Lucene、Carrot2、Cloudera、Tika、ZooKeeper、Solr、Katta、Cascading、POI...

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...

当您阅读某个工具时,您通常可以确定将提及其他每个工具.

When you read about the one you can be often sure that each of the others tools is going to be mentioned.

我不希望您向我解释每个工具 - 当然不是.如果您能帮助我针对我的特定场景缩小此设置的范围,那就太好了.到目前为止,我不确定以上哪种方法适合,而且看起来(一如既往)有不止一种方法可以完成要完成的工作.

I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.

场景是:500GB - ~20 TB 的文档存储在 Hadoop 中.多种格式的文本文档:电子邮​​件、文档、pdf、odt.有关存储在 SQL 数据库中的那些文档的元数据(发件人、收件人、日期、部门等).文档的主要来源将是 ExchangeServer(电子邮件和附件),但不仅如此.现在开始搜索:用户需要能够对这些文档进行复杂的全文搜索.基本上,他会看到一些搜索配置面板(java 桌面应用程序,而不是 webapp)——他会设置日期范围、文档类型、发件人/收件人、关键字等——启动搜索并获取文档的结果列表(以及每个文档的信息,为什么它包含在搜索结果中,即在文档中找到了哪些关键字).

The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).

我应该考虑哪些工具,哪些不应该考虑?关键是开发这样的解决方案,只需要最少的胶水"代码.我精通 SQLdbs,但对 Apache 和相关技术不太熟悉.

Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.

基本工作流程如下所示:ExchangeServer/other source -> 从 doc/pdf/... 转换 -> 重复数据删除 -> Hadoop + SQL(元数据) -> 构建/更新索引 <- 搜索文档(并快速完成)-> 显示搜索结果

Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results

谢谢!

推荐答案

使用 solr 是一个不错的选择.我已经将它用于您上面描述的类似场景.您可以将 solr 用于真正的海量数据作为其分布式索引服务器.

Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.

但是要获取有关所有这些文档格式的元数据,您应该使用其他工具.基本上你的工作流程就是这样.

But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.

1) 使用 hadoop 集群存储数据.

1) Use hadoop cluster to store data.

2) 使用 map/redcue 提取 hadoop 集群中的数据

2) Extract data in hadoop cluster using map/redcue

3) 做文档识别(识别文档类型)

3) Do document identification( identify document type)

4) 从这些文档中提取元数据.

4) Extract meta data from these document.

5) 在 solr 服务器中索引元数据,在数据库中存储其他摄取信息

5) Index metadata in solr server, store other ingestion information in database

6) Solr 服务器是分布式索引服务器,因此对于每次摄取,您都可以创建一个新的分片或索引.

6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.

7) 当需要搜索时,搜索所有索引.

7) When search is required search on all the indexs.

8) Solr 支持所有复杂的搜索,因此您不必制作自己的搜索引擎.

8) Solr supports all the complex searches , so you don't have to make your own search engine.

9) 它还可以为您分页.

9) It also does paging for you as well.

这篇关于搜索存储在 Hadoop 中的文档 - 使用哪种工具?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆