如何通过PHP识别google/yahoo/msn的网络爬虫? [英] how to identify web crawlers of google/yahoo/msn by PHP?

查看:227
本文介绍了如何通过PHP识别google/yahoo/msn的网络爬虫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AFAIK,

$ _ SERVER ['REMOTE_HOST']应该以"google.com"或"yahoo.com"结尾.

$_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com".

但这是最保证的方法吗?

but is it the most ensuring method?

还有其他出路吗?

推荐答案

您通过来标识搜索引擎用户代理和IP地址.可以在如何识别搜索引擎蜘蛛和网络机器人中找到更多信息.还要注意此列表.但是,您不应将用户代理(甚至远程主机)视为必定的.用户代理实际上只不过是另一端告诉您的内容,当然可以自由地告诉您任何内容.伪装成Googlebot的代码很简单.

You identify search engines by user agent and IP address. More info can be found in How to identify search engine spiders and webbots. It's also worth noting this list. You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course free to tell you anything. It's trivial to write code to pretend to be Googlebot.

在PHP中,这意味着查看$_SERVER['HTTP_USER_AGENT']$_SERVER['REMOTE_HOST'].

In PHP, this means looking at $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_HOST'].

有很多搜索引擎,但老实说,这只是您真正关心的少数几个.谷歌和雅虎一起拥有几乎所有的市场.但这当然取决于您要实现的目标.

There are a lot of search engines but honestly it's only the big few you really care about generally speaking. Google and Yahoo together have almost all of the market. But of course it depends on what you're trying to achieve.

注意:在内容方面,请务必谨慎对待与普通用户不同的搜索引擎(例如Joel所说的邪恶的连字符网站").在特别严重的情况下,这可能会使您的网站从该搜索引擎中删除.即使这种情况没有发生,您也可能会将一些期望获得某些信息的用户拒之门外.如果出现了请注册以查看本文"框,那么请跳高您的跳出率.

Note: be very careful of treating search engines differently to normal users (like the "evil hyphen site" as Joel put it) when it comes to content. In particularly egregious cases, this could get your site removed from that search engine. Even if that doesn't happen you will probably put some users off who go to a site expecting something. If they're then presented with a "Please register to see this article" box instead, well, gratz on your high bounce rate.

这篇关于如何通过PHP识别google/yahoo/msn的网络爬虫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆