搜索存储在 Hadoop 中的文档 - 使用哪种工具? [英] Searching over documents stored in Hadoop - which tool to use?

查看：21 发布时间：2021/12/30 8:27:08 solr hadoop lucene cloudera carrot2

本文介绍了搜索存储在 Hadoop 中的文档 - 使用哪种工具?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我迷失在:Hadoop、Hbase、Lucene、Carrot2、Cloudera、Tika、ZooKeeper、Solr、Katta、Cascading、POI...

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...

当您阅读某个工具时，您通常可以确定将提及其他每个工具.

When you read about the one you can be often sure that each of the others tools is going to be mentioned.

我不希望您向我解释每个工具 - 当然不是.如果您能帮助我针对我的特定场景缩小此设置的范围，那就太好了.到目前为止，我不确定以上哪种方法适合，而且看起来(一如既往)有不止一种方法可以完成要完成的工作.

I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.

场景是:500GB - ~20 TB 的文档存储在 Hadoop 中.多种格式的文本文档:电子邮件、文档、pdf、odt.有关存储在 SQL 数据库中的那些文档的元数据(发件人、收件人、日期、部门等).文档的主要来源将是 ExchangeServer(电子邮件和附件)，但不仅如此.现在开始搜索:用户需要能够对这些文档进行复杂的全文搜索.基本上，他会看到一些搜索配置面板(java 桌面应用程序，而不是 webapp)——他会设置日期范围、文档类型、发件人/收件人、关键字等——启动搜索并获取文档的结果列表(以及每个文档的信息，为什么它包含在搜索结果中，即在文档中找到了哪些关键字).

The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).

我应该考虑哪些工具，哪些不应该考虑?关键是开发这样的解决方案，只需要最少的胶水"代码.我精通 SQLdbs，但对 Apache 和相关技术不太熟悉.

Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.

基本工作流程如下所示:ExchangeServer/other source -> 从 doc/pdf/... 转换 -> 重复数据删除 -> Hadoop + SQL(元数据) -> 构建/更新索引 <- 搜索文档(并快速完成)-> 显示搜索结果

Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results

谢谢！

搜索存储在 Hadoop 中的文档 - 使用哪种工具? [英] Searching over documents stored in Hadoop - which tool to use?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

搜索存储在 Hadoop 中的文档 - 使用哪种工具? [英] Searching over documents stored in Hadoop - which tool to use?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭