如何设置robots.txt或Apache来让爬虫只在特定时间? [英] How to set Robots.txt or Apache to allow crawlers only at certain hours?

查看:419
本文介绍了如何设置robots.txt或Apache来让爬虫只在特定时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当流量超过24小时分布不均,我想在非繁忙时段内偷看小时禁止爬虫,并允许他们。

As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.

有没有办法做到这一点的方法?

Is there a method to achieve this?

编辑:
感谢所有的好建议。

edit: thanks for all the good advice.

这是我们发现了另一个解决方案。

This is another solution we found.

2bits.com对设置的IPTables防火墙限制从特定的IP地址的连接的数量的制品。

2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.

文章

的iptables的设置:

the setting of IPTables:


  • 使用connlimit

在新的Linux内核,对于iptables的一个connlimit模块。它可以像这样使用:

In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:

iptables的INPUT -I -p tcp的-m connlimit --connlimit-以上5 -j REJECT

iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT

此限制从每个IP地址的连接,以不超过5个同时连接。这种从同时按网站口粮的连接,prevents爬虫。
*

This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously. *

推荐答案

您无法控制,在robots.txt文件。这可能是一些爬虫可能支持这样的事情,但没有大的做(据我所知)。

You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).

动态改变的robots.txt文件也是在这样的情况下,一个坏主意。大多数爬行缓存一段时间的robots.txt文件,并继续使用它,直到刷新缓存。如果他们在合适的时间缓存的话,他们通常会整天都爬行。如果他们在错误时间缓存的话,就完全停止爬行(甚至删除他们的指数编制索引的网址)。例如,谷歌一般缓存的robots.txt文件一天,这意味着一天的过程中改变将是不可见的Googlebot的。

Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.

如果爬行导致您的服务器上的负载过大,有时可以调整各个爬虫的抓取速度。例如,对于Googlebot的您可以在谷歌这样做网站管理员工具

If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.

此外,当爬虫尝试在高负荷的时候抓取,你可以永远只是满足他们的 503 HTTP结果code 。这告诉爬虫回来检查在一段时间后(你也可以指定一个重发后的HTTP头,如果你知道什么时候应该回来)。虽然我会尽量避免时间上的日依据严格这样做(这可以阻止许多其他功能,如站点地图,相关的广告或网站验证,可以减缓一般爬行),在特殊情况下可能道理这样做。从长远来看,我强烈建议只执行此操作时您的服务器负载实在是太高成功返回的内容抓取。

Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.

这篇关于如何设置robots.txt或Apache来让爬虫只在特定时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆