php爬虫检测 [英] php crawler detection

查看:38
本文介绍了php爬虫检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个 sitemap.php,它的行为取决于正在查找的人.

我想将爬虫重定向到我的 sitemap.xml,因为这将是最新的页面并且将包含他们需要的所有信息,但我希望我的普通读者在 php 页面上显示一个 html 站点地图.

这都将在 php 标头中进行控制,我发现 这段代码在网络上看起来应该可以工作,但事实并非如此.谁能帮我破解这个?

function getIsCrawler($userAgent) {$crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|'.'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|'.'GeonaBot|Gigabot|Lycos|MSRBOT|滑板车|AltaVista|IDBot|eStyle|Scrubby';$isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);返回 $isCrawler;}$iscrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);如果($isCrawler){header('位置:http://www.website.com/sitemap.xml');出口;} 别的 {echo "不是爬虫!";}

看起来很简单,但正如你所看到的,我已经将 firefox 添加到代理列表中,果然我没有被重定向..

感谢您的帮助:)

解决方案

您的代码有误:

$crawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

应该

$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

如果您在开发时带有通知,您将更容易发现这些错误.

此外,您可能希望在 header

之后 exit

警告:伪装会让您在搜索提供商方面遇到麻烦.这篇文章解释了原因.

I'm trying to write a sitemap.php which acts differently depending on who is looking.

I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page.

This will all be controlled from within the php header, and I've found this code on the web which by the looks of it should work, but it's not. Can anyone help crack this for me?

function getIsCrawler($userAgent) {
    $crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
    'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
    'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
    $isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);
    return $isCrawler;
}

$iscrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

if ($isCrawler) {
    header('Location: http://www.website.com/sitemap.xml');
    exit;
} else {
    echo "not crawler!";
}

It looks pretty simple, but as you can see i've added firefox into the agent list, and sure enough I'm not being redirected..

Thanks for any help :)

解决方案

You have a mistake in your code:

$crawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

should be

$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

If you develop with notices on you'll catch these errors much more easily.

Also, you probable want to exit after the header

Warning: Cloaking can get you in trouble with search providers. This article explains why.

这篇关于php爬虫检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆