如何使用 PHP 检测爬虫/蜘蛛? [英] How can one detect a crawler / spider using PHP?

查看:51
本文介绍了如何使用 PHP 检测爬虫/蜘蛛?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 PHP 检测爬虫/蜘蛛?

我目前正在开展一个项目,我需要跟踪每个爬虫的访问情况.
我知道您应该使用 HTTP_USER_AGENT,但我不确定如何为此目的格式化代码,而且我知道可以非常轻松地更改 USER AGENT 所以我还想知道是否可以添加还有一些参数可以避免欺骗吗?

我正在尝试做的示例代码..

谢谢

解决方案

根据验证 Googlebot:

<块引用>

您可以使用反向 DNS 查找来验证访问您服务器的机器人是否确实是 Googlebot(或其他 Google 用户代理),验证名称是否在 googlebot.com 域中,然后使用那个谷歌机器人的名字.如果您担心垃圾邮件发送者或其他麻烦制造者在声称自己是 Googlebot 的同时访问您的网站,这将非常有用.

<块引用>

例如:

<块引用><块引用>

主机 66.249.66.1
1.66.249.66.in-addr.arpa域名指针
crawl-66-249-66-1.googlebot.com.

<块引用><块引用>

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com 的地址是 66.249.66.1
Google 不会发布供网站管理员加入白名单的公开 IP 地址列表.这是因为这些 IP 地址范围可能会发生变化,从而给任何对其进行硬编码的网站管理员带来问题.识别 Googlebot 访问的最佳方法是使用用户代理 (Googlebot).

您可以进行反向 DNS 查找:

function validateGoogleBotIP($ip) {$hostname = gethostbyaddr($ip);//crawl-66-249-66-1.googlebot.com"return preg_match('/\.google(bot)?\.com$/i', $hostname);}如果 (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {echo '这实际上是谷歌';} 别的 {echo '有人在伪造它!';}} 别的 {echo '与谷歌无关';}

How can one detect a crawler / spider using PHP?

I'm currently working on a project where I need to keep track of each crawler's visit.
I know that you should use HTTP_USER_AGENT but I'm not really sure how to format the code for this purpose and i know that the USER AGENT can be changed very easy so i would also like to know if it is possible to add some more parameters to avoid spoofing?

Sample code of what i'm trying to do..

<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>

Thank you

解决方案

According to Verifying Googlebot:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

You can do a reverse DNS lookup:

function validateGoogleBotIP($ip) {
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

    return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
    if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
        echo 'It is ACTUALLY google';
    } else {
        echo 'Someone\'s faking it!';
    }
} else {
    echo 'Nothing to do with Google';
}

这篇关于如何使用 PHP 检测爬虫/蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆