Nutch 任何人都可以解释 readdb stats 中的状态名称指示什么 [英] Nutch can anyone explain what are status name indicates in readdb stats

查看:49
本文介绍了Nutch 任何人都可以解释 readdb stats 中的状态名称指示什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Nutch 谁能解释一下 readdb stats 中状态名称的含义.

1.db_redir_perm2.db_unfetched3.db_fetched4.db_Gone5.db_redir_temp6.db_duplicate7.db_notmodified.

解决方案

Nutch 将 URL 的所有元数据信息存储在 CrawlDatum 对象.它存储在 /crawldb/*/part-*/data 位置

根据 CrawlDatum 的源代码

/** 页面尚未获取.*/db_unfetched -->公共静态最终字节 STATUS_DB_UNFETCHED = 0x01;/** 页面获取成功.*/db_fetched -->公共静态最终字节 STATUS_DB_FETCHED = 0x02;/** 页面不再存在.*/db_Gone -->公共静态最终字节 STATUS_DB_GONE = 0x03;/** 页面临时重定向到其他页面.*/db_redir_temp -->公共静态最终字节 STATUS_DB_REDIR_TEMP = 0x04;/** 页面永久重定向到其他页面.*/db_redir_perm -->公共静态最终字节 STATUS_DB_REDIR_PERM = 0x05;/** 页面被成功抓取,发现没有被修改.*/db_notmodified -->公共静态最终字节 STATUS_DB_NOTMODIFIED = 0x06;/** 页面被标记为与另一页面重复 */db_duplicate -->公共静态最终字节 STATUS_DB_DUPLICATE = 0x07;

CrawlDatum private byte status; 将根据 URL 的状态采用上述值之一.(还有很多我现在不讨论的其他标志)

CrawlDatum(object)的状态值什么时候改变?

有很多流程可能会采用上述几种状态之一.我将解释一些我非常了解的流程.

  1. 当我们将 URL 注入 nutch 时.crawlDb 文件夹是用每个 URL CrawlDatum 对象创建的,状态为 (db_unfetched).请参阅以下来自 Injector 类的代码

InjectReducer.reduce 方法.

for (CrawlDatum val : values) {if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {注入.set(val);注入.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);注入集 = 真;} 别的 {old.set(val);oldSet = 真;}}

通过设置此标志,生成器阶段将有助于仅选择未提取的网址.

  1. 在 Fetcher 阶段,如果您打开 FetcherThread 源代码.crawlDatum 状态根据 url http 统计代码更改.您可以在此处参考 http 代码.(为了更好地理解)

<块引用>

case ProtocolStatus.MOVED://重定向案例 ProtocolStatus.TEMP_MOVED:国际代码;布尔温度;if (status.getCode() == ProtocolStatus.MOVED) {代码 = CrawlDatum.STATUS_FETCH_REDIR_PERM;温度 = 假;} 别的 {代码 = CrawlDatum.STATUS_FETCH_REDIR_TEMP;温度 = 真;}输出(fit.url,fit.datum,内容,状态,代码);String newUrl = status.getMessage();文本 redirUrl = handleRedirect(fit, newUrl, temp,Fetcher.PROTOCOL_REDIR);如果(redirUrl != null){适合 = queueRedirect(redirUrl,适合);} 别的 {//停止重定向重定向 = 假;}休息;case ProtocolStatus.EXCEPTION:logError(fit.url, status.getMessage());int killURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit.getQueueID());如果(killedURLs != 0)context.getCounter("FetcherStatus",AboveExceptionThresholdInQueue").increment(killedURLs);/* 跌倒 */case ProtocolStatus.RETRY://重试案例 ProtocolStatus.BLOCKED:输出(fit.url,fit.datum,空,状态,CrawlDatum.STATUS_FETCH_RETRY);休息;case ProtocolStatus.GONE://消失了案例 ProtocolStatus.NOTFOUND:案例 ProtocolStatus.ACCESS_DENIED:案例 ProtocolStatus.ROBOTS_DENIED:输出(fit.url,fit.datum,空,状态,CrawlDatum.STATUS_FETCH_GONE);休息;case ProtocolStatus.NOTMODIFIED:输出(fit.url,fit.datum,空,状态,CrawlDatum.STATUS_FETCH_NOTMODIFIED);休息;默认:如果(LOG.isWarnEnabled()){LOG.warn("{} {} Unknown ProtocolStatus: {}", getName(),Thread.currentThread().getId(), status.getCode());}输出(fit.url,fit.datum,空,状态,CrawlDatum.STATUS_FETCH_RETRY);

 if (redirecting && redirectCount > maxRedirect) {((FetchItemQueues) fetchQueues).finishFetchItem(fit);如果(LOG.isInfoEnabled()){LOG.info("{} {} - 重定向计数超出 {}", getName(),Thread.currentThread().getId(), fit.url);}输出(fit.url,fit.datum,空,ProtocolStatus.STATUS_REDIR_EXCEEDED,CrawlDatum.STATUS_FETCH_GONE);}

  1. 在重复数据删除阶段,如果根据 md5 哈希发现 URL 重复,那么它将在重复数据删除阶段将状态标记为 STATUS_DB_DUPLICATE 并且在下一次迭代中它不会被 <强>发电机.

Nutch can anyone explain what are status name indicates in readdb stats.

1.db_redir_perm 2.db_unfetched 3.db_fetched 4.db_Gone 5.db_redir_temp 6.db_duplicate 7.db_notmodified.

解决方案

Nutch store all the metadata information of URLs in CrawlDatum Object. and it is stored in /crawldb/*/part-*/data location

As per the source code of CrawlDatum

 /** Page was not fetched yet. */
   db_unfetched -->   public static final byte STATUS_DB_UNFETCHED = 0x01; 
      /** Page was successfully fetched. */
   db_fetched -->   public static final byte STATUS_DB_FETCHED = 0x02;
      /** Page no longer exists. */
   db_Gone -->   public static final byte STATUS_DB_GONE = 0x03;
      /** Page temporarily redirects to other page. */
   db_redir_temp -->   public static final byte STATUS_DB_REDIR_TEMP = 0x04;
      /** Page permanently redirects to other page. */
   db_redir_perm -->   public static final byte STATUS_DB_REDIR_PERM = 0x05;
      /** Page was successfully fetched and found not modified. */
   db_notmodified -->   public static final byte STATUS_DB_NOTMODIFIED = 0x06;
      /** Page was marked as being a duplicate of another page */
   db_duplicate -->   public static final byte STATUS_DB_DUPLICATE = 0x07;

CrawlDatum private byte status; will take one of the values mentioned above depending on the state of URL. (and there are lot of other flags which i'm not discussing now)

When will status value of CrawlDatum(object) change?

There are a lot of flows where it might take one of the several states mentioned above.I will explain a few flows which I'm well aware of.

  1. when we inject URLs into nutch. crawlDb folder is created with each URL CrawlDatum object with state as (db_unfetched). see below code from Injector class

InjectReducer.reduce method.

for (CrawlDatum val : values) {
    if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
      injected.set(val);
      injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
      injectedSet = true;
    } else {
      old.set(val);
      oldSet = true;
    }
  }

By setting this flag it will be helpful for the generator phase to pick only unfetched urls.

  1. In Fetcher phase if you open FetcherThread source code. crawlDatum status is changed based on url http stats code. you can refer http codes here. (for better understanding)

case ProtocolStatus.MOVED: // redirect
    case ProtocolStatus.TEMP_MOVED:
      int code;
      boolean temp;
      if (status.getCode() == ProtocolStatus.MOVED) {
        code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
        temp = false;
      } else {
        code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
        temp = true;
      }
      output(fit.url, fit.datum, content, status, code);
      String newUrl = status.getMessage();
      Text redirUrl = handleRedirect(fit, newUrl, temp,
          Fetcher.PROTOCOL_REDIR);
      if (redirUrl != null) {
        fit = queueRedirect(redirUrl, fit);
      } else {
        // stop redirecting
        redirecting = false;
      }
      break;
    case ProtocolStatus.EXCEPTION:
      logError(fit.url, status.getMessage());
      int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit
          .getQueueID());
      if (killedURLs != 0)
        context.getCounter("FetcherStatus",
            "AboveExceptionThresholdInQueue").increment(killedURLs);
      /* FALLTHROUGH */
    case ProtocolStatus.RETRY: // retry
    case ProtocolStatus.BLOCKED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_RETRY);
      break;
    case ProtocolStatus.GONE: // gone
    case ProtocolStatus.NOTFOUND:
    case ProtocolStatus.ACCESS_DENIED:
    case ProtocolStatus.ROBOTS_DENIED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_GONE);
      break;
    case ProtocolStatus.NOTMODIFIED:
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_NOTMODIFIED);
      break;
    default:
      if (LOG.isWarnEnabled()) {
        LOG.warn("{} {} Unknown ProtocolStatus: {}", getName(),
            Thread.currentThread().getId(), status.getCode());
      }
      output(fit.url, fit.datum, null, status,
          CrawlDatum.STATUS_FETCH_RETRY);

    if (redirecting && redirectCount > maxRedirect) {
      ((FetchItemQueues) fetchQueues).finishFetchItem(fit);
      if (LOG.isInfoEnabled()) {
        LOG.info("{} {} - redirect count exceeded {}", getName(),
            Thread.currentThread().getId(), fit.url);
      }
      output(fit.url, fit.datum, null,
          ProtocolStatus.STATUS_REDIR_EXCEEDED,
          CrawlDatum.STATUS_FETCH_GONE);
    }

  1. In deduplication phase if a URLs is found to be duplicate based on md5 hash then it will mark the status as STATUS_DB_DUPLICATE in the deduplication phase and in the next iteration it will not be picked by the Generator.

这篇关于Nutch 任何人都可以解释 readdb stats 中的状态名称指示什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆