Nutch API建议 [英] Nutch API advice

查看:95
本文介绍了Nutch API建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个项目,我需要一个成熟的爬虫来完成一些工作,而我正在为此目的评估Nutch。
我目前的需求相对简单:我需要一个能够将数据保存到磁盘的爬虫,我需要它能够只重新抓取站点的更新资源并跳过已经爬行的部分。
有没有人有使用Java直接使用Nutch代码的经验,而不是通过命令行。我想从简单开始:创建一个爬虫(或类似的),最低限度地配置它并启动它,没什么特别的。
是否有一些例子,或者我应该看一些资源?我正在浏览Nutch文档,但大多数是关于命令行,搜索和其他东西。
Nutch爬行模块的可用性如何,无需索引和搜索?
任何帮助表示赞赏。
谢谢。

I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or similar), minimally configure it and start it, nothing fancy. Is there some example for this, or some resource I should be looking at? I'm going over the Nutch documentation, but most of it is about command line, search and other stuff. How usable is the Nutch crawling module without the need to index and search? Any help is appreciated. Thanks.

推荐答案

Nutch与你最常见的非常不同。
因为它类似于框架,所以它不仅具有查询前端的功能。搜索,尽管solr似乎比原生Nutch搜索前端更强大。它还有抓取部分和索引(转换为Lucene索引)。

Nutch is very different than what you have ever practiced most probably. Because it is something like a framework it not only has front for query & search, athough solr seems more powerfull than the native Nutch search front end. It also has the crawling part and the indexing (into a Lucene indexe).

如果你想将抓取用于搜索以外的其他目的,你需要开发你的拥有自己的程序并熟悉Hadoop和MapReduce编程。

If you want to use the crawled for other purposes than search, you will need to developp your own programms and be familiar with Hadoop and MapReduce programming.

不确定你想要对你的爬行做什么,但它看起来不像Nutch是解决方案

Not sure what you want to do with your crawling, but it doesn't look like Nutch is the solution

这篇关于Nutch API建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆