如何增加 Apache Nutch 爬虫抓取的文档数量 [英] How to increase number of documents fetched by Apache Nutch crawler

查看:51
本文介绍了如何增加 Apache Nutch 爬虫抓取的文档数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Apache Nutch 2.3 进行爬网.开始时种子中有大约 200 个 url.现在随着时间的推移,文档爬虫的数量将减少或最多与开始时相同.

I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start.

如何配置 Nutch 以便增加我抓取的文档?有没有可以用来控制文档数量的参数?二、如何统计nutch每天爬取的文档数?

How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch?

推荐答案

一个抓取周期包括四个步骤:生成、获取、解析和更新数据库.有关详细信息,请阅读我的回答 此处.

One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here.

导致 URL 获取受限的原因可能是以下因素:

Whats causing limited URL fetch can be caused by the following factors:

抓取周期数:

如果您只执行一个抓取周期,那么您将获得的结果很少,因为最初会获取注入或植入 crawldb 的 URL.在渐进式抓取周期中,您的 crawldb 将使用从之前抓取的页面中提取的新网址进行更新.

If you are only executing one crawl cycle then you will get few results as the URLs injected or seeded into crawldb will be fetched initially. On progressive crawl cycles your crawldb will updated with new URLs extracted from previously fetched pages.

topN 值:

此处此处,topN 值导致 nutch 在每个循环中获取有限数量的 URL.如果您的 topN 值较小,您将获得较少的页数.

As mentioned here and here, topN value cause nutch to fetch the limited number of URLs on each cycle. If you have small topN value, you will get less number of pages.

generate.max.count

generate.max.count 在您的 nutch 配置文件中,即 nutch-default.xmlnutch-site.xml 限制数量如此处所述,要从单个域中获取 URL.

generate.max.count in your nutch configuration file i.e nutch-default.xml or nutch-site.xml limits the number of URLs to be fetched form the single domain as stated here.

回答关于如何计算每天抓取的网页数的第二个问题.您可以做的是读取日志文件.从那里您可以积累有关每天抓取的页面数量的信息.

Answer to your second question on how to count number of pages crawled per day. What you can do is to read the log files. From there you can accumulate the information on the number of pages crawled per day.

在 nutch 1.x 日志文件中生成日志文件夹 NUTCH_HOME/logs/hadoop.log

In nutch 1.x log file is generated in log folder NUTCH_HOME/logs/hadoop.log

您可以像这样从日志中根据日期和状态获取"来计算行数:

You can count the lines with respect to date and status "fetching" from the logs like this:

cat 日志/hadoop.log |grep -i 2016-05-26.*fetching |wc -l

这篇关于如何增加 Apache Nutch 爬虫抓取的文档数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆