来自 facebookexternalhit bot 的过多流量 [英] excessive traffic from facebookexternalhit bot

查看:33
本文介绍了来自 facebookexternalhit bot 的过多流量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道如何告诉facebookexternalhit"机器人来传播其流量?

Does anyone know how tell the 'facebookexternalhit' bot to spread its traffic?

我们的网站每 45 到 60 分钟就会受到一次攻击,峰值约为每秒 400 个请求,来自 facebook netblocks 的 20 到 30 个不同的 IP 地址.在尖峰之间,流量不会消失,但负载是可以接受的.当然,我们不想阻止机器人,但这些尖峰是有风险的.我们更希望看到机器人随着时间的推移平均分配它的负载.看看它的行为就像 Googlebot &朋友.

Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests per second, from 20 to 30 different IP addresses from the facebook netblocks. Between the spikes the traffic does not disappear, but the load is acceptable. Offcourse we do not want to block the bot, but these spikes are risky. We'd prefer to see the bot spread it's load equally over time. And see it behave like Googlebot & friends.

我看过相关的错误报告(第一个错误第二个错误和第三个错误 (#385275384858817)),但找不到任何关于如何管理负载的建议.

I've seen related bug reports ( First Bug, Second Bug and Third Bug (#385275384858817)), but could not find any suggestions how to manage the load.

推荐答案

根据其他答案,Facebook 的半官方说法是suck it".他们无法关注 Crawl-delay(是的,我知道它不是一个爬虫",但是在几秒钟内获取 100 个页面是一种爬行,不管你想怎么称呼它).

Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).

由于无法满足他们的狂妄自大,而且放弃他们的 IP 块是非常严厉的,这是我的技术解决方案.

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

在 PHP 中,为每个请求尽可能快地执行以下代码.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && strpos(  $_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit' ) === 0 ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 429' );
        die;
    }
}

您可以使用以下命令从命令行进行测试:

You can test this from a command line with something like:

$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html

欢迎提出改进建议......我猜它们可能是一些并发问题,并带来了巨大的冲击.

Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.

这篇关于来自 facebookexternalhit bot 的过多流量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆