推荐使用 Lucene 或 Solr 的爬虫工具? [英] Recommendations for a spidering tool to use with Lucene or Solr?

查看：26 发布时间：2021/12/30 8:59:02 lucene solr web-crawler

本文介绍了推荐使用 Lucene 或 Solr 的爬虫工具?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

用于处理 HTML 和 XML 文档(本地或基于 Web)并且在 Lucene/Solr 解决方案空间中运行良好的爬虫(蜘蛛)是什么?可以是基于 Java 的，但不一定是.

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.

推荐答案

在我看来，这是一个非常重要的漏洞，它阻碍了 Solr 的广泛采用.新的 DataImportHandler 是导入结构化数据的良好开端，但没有用于 Solr 的良好文档摄取管道.Nutch 确实有效，但 Nutch 爬虫和 Solr 之间的集成有点笨拙.
我已经尝试了我能找到的所有开源爬虫，但没有一个与 Solr 集成开箱即用.
密切关注 OpenPipeline 和 Apache Tika.

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.

这篇关于推荐使用 Lucene 或 Solr 的爬虫工具?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

推荐使用 Lucene 或 Solr 的爬虫工具? [英] Recommendations for a spidering tool to use with Lucene or Solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

推荐使用 Lucene 或 Solr 的爬虫工具? [英] Recommendations for a spidering tool to use with Lucene or Solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭