来自facebookexternalhit机器人的流量过多 [英] excessive traffic from facebookexternalhit bot

查看:2066
本文介绍了来自facebookexternalhit机器人的流量过多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道如何告诉'facebookexternalhit'机器人传播其流量?



我们的网站每45 - 60分钟敲击一次大约的峰值。每秒400个请求,来自facebook网络块的20到30个不同的IP地址。在峰值之间,流量不消失,但负载是可以接受的。我们不想拦截机器人,但是这些尖峰是有风险的。我们更希望看到bot随着时间的流逝而平均分配。看到它的行为像Googlebot&朋友。



我看过相关的错误报告(第一个Bug 第二个Bug 和第三个Bug(#385275384858817)),但是没有找到任何建议如何管理负载。

解决方案

其他答案,Facebook的半官方字是吮吸它。它让我感到困扰他们无法跟踪爬网延迟(是的,我知道它不是一个爬行器,但是在几秒钟内就可以抓到100页,无论你想要什么,都可以爬行。



由于不能吸引他们的傲慢,而且DROP的IP阻止是非常糟糕的,这里是我的技术解决方案。



在PHP中,为每个请求尽可能快地执行以下代码。

  define('FACEBOOK_REQUEST_THROTTLE' 2.0); // facebookexternalhit 

之间的每个命中之间允许的秒数允许(!empty($ _SERVER ['HTTP_USER_AGENT'])&& preg_match('/ ^ facebookexternalhit /',$ _SERVER ['HTTP_USER_AGENT '])){
$ fbTmpFile = sys_get_temp_dir()。'/ facebookexternalhit.txt';
if($ fh = fopen($ fbTmpFile,'c +')){
$ lastTime = fread($ fh,100);
$ microTime = microtime(TRUE);
//检查当前微时间,最后一次访问的时间为
if($ microTime - $ lastTime< FACEBOOK_REQUEST_THROTTLE){
//如果请求过快,http 503服务不可用
头($ _SERVER [SERVER_PROTOCOL]。'503');

} else {
//写出上一次访问的微秒时间
rewind($ fh);
fwrite($ fh,$ microTime);
}
fclose($ fh);
} else {
header($ _SERVER [SERVER_PROTOCOL]。'503');

}
}

您可以从命令行测试这样的东西, :

  $ rm index.html *; wget -Ufacebookexternalhit / 1.0(+ http://www.facebook.com/externalhit_uatext.php)http://www.foobar.com/; less index.html 

改进建议是受欢迎的...我猜猜他们可能是一些并发问题一个巨大的爆炸。


Does anyone know how tell the 'facebookexternalhit' bot to spread its traffic?

Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests per second, from 20 to 30 different IP addresses from the facebook netblocks. Between the spikes the traffic does not disappear, but the load is acceptable. Offcourse we do not want to block the bot, but these spikes are risky. We'd prefer to see the bot spread it's load equally over time. And see it behave like Googlebot & friends.

I've seen related bug reports ( First Bug, Second Bug and Third Bug (#385275384858817)), but could not find any suggestions how to manage the load.

解决方案

Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 503' );
        die;
    }
}

You can test this from a command line with something like:

$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html

Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.

这篇关于来自facebookexternalhit机器人的流量过多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆