为什么 Nutch 似乎不知道“Last-Modified"? [英] Why doesn't Nutch seem to know about "Last-Modified"?

查看:43
本文介绍了为什么 Nutch 似乎不知道“Last-Modified"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将 Nutch 设置为 db.fetch.interval.default 为 60000,这样我就可以每天爬行.如果我不这样做,它甚至不会在我第二天抓取时查看我的网站.但是当我在第二天抓取时,它昨天获取的每个页面都会获取一个 200 响应代码,这表明它没有在If-Modified-Since"中使用前一天的日期.它不应该跳过获取未更改的页面吗?有没有办法让它这样做?我注意到 Fetcher.java 中有一个 ProtocolStatus.NOT_MODIFIED,所以我认为它应该能够做到这一点,不是吗?

I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modified-Since". Shouldn't it skip fetching pages that haven't changed? Is there a way to make it do that? I noticed a ProtocolStatus.NOT_MODIFIED in Fetcher.java, so I think it should be able to do this, shouldn't it?

顺便说一下,这是从当前主干的 conf/nutch-default.xml 中剪切粘贴的:

By the way, this is cut and pasted from conf/nutch-default.xml from the current trunk:

<!-- web db properties -->

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>(DEPRECATED) The default number of days between re-fetches of a page.
  </description>
</property>

<property>
  <name>db.fetch.interval.default</name>
  <value>2592000</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

推荐答案

我发现了问题.这是 Nutch 中的一个错误.我已经通过电子邮件向 Nutch 开发人员列表发送了相关信息,但这是我的解决方法:

I found the problem. It's a bug in Nutch. I've emailed the Nutch developer list about it, but here's my fix:

Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===================================================================
--- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java  (revision 802632)
+++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java  (working copy)
@@ -124,11 +124,15 @@
         reqStr.append("\r\n");
       }

-      reqStr.append("\r\n");
       if (datum.getModifiedTime() > 0) {
         reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime()));
         reqStr.append("\r\n");
       }
+      else if (datum.getFetchTime() > 0) {
+          reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getFetchTime()));
+          reqStr.append("\r\n");
+      }
+      reqStr.append("\r\n");     

       byte[] reqBytes= reqStr.toString().getBytes();

现在我在我的 Apache 日志中看到 304,我应该看到它们.

Now I'm seeing 304s in my Apache logs where I'm supposed to be seeing them.

这篇关于为什么 Nutch 似乎不知道“Last-Modified"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆