对与 Lucene 或 Solr 一起使用的爬虫工具的建议? [英] Recommendations for a spidering tool to use with Lucene or Solr?

查看：20 发布时间：2022/1/15 13:14:23 lucene solr web-crawler

本文介绍了对与 Lucene 或 Solr 一起使用的爬虫工具的建议?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于 HTML 和 XML 文档(本地或基于 Web)并在 Lucene/Solr 解决方案空间中运行良好的爬虫(蜘蛛)是什么?可以是基于 Java 的，但不是必须的.

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.

推荐答案

在我看来，这是一个非常重要的漏洞，它阻碍了 Solr 的广泛采用.新的 DataImportHandler 是导入结构化数据的良好第一步，但 Solr 没有一个好的文档摄取管道.Nutch 确实有效，但是 Nutch 爬虫和 Solr 之间的集成有点笨拙.
我已经尝试了所有我能找到的开源爬虫，但它们都没有与 Solr 开箱即用地集成.
密切关注 OpenPipeline 和 Apache Tika.

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.

这篇关于对与 Lucene 或 Solr 一起使用的爬虫工具的建议?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

对与 Lucene 或 Solr 一起使用的爬虫工具的建议? [英] Recommendations for a spidering tool to use with Lucene or Solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

对与 Lucene 或 Solr 一起使用的爬虫工具的建议? [英] Recommendations for a spidering tool to use with Lucene or Solr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭