如何识别网络爬虫? [英] How to identify web-crawler?

查看:126
本文介绍了如何识别网络爬虫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何过滤来自网络抓取工具等的点击.不是人类的点击.

How can I filter out hits from webcrawlers etc. Hits which not is human..

我使用maxmind.com从IP请求城市.如果我必须支付所有点击(包括网络抓取工具,机器人等)的费用,这并不便宜.

I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.

推荐答案

检测机器人的方法有两种,我将其称为礼貌/被动"和激进".基本上,您必须使您的网站出现心理障碍.

There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder.

这些是礼貌地告诉抓取工具他们不应该抓取您的网站并限制抓取频率的方法.通过 robots.txt 文件确保礼貌,您可以在其中指定允许哪些漫游器(如果有的话)抓取您的漫游器网站以及您的网站可以被抓取的频率.前提是您要处理的机器人礼貌.

These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txt file in which you specify which bots, if any, should be allowed to crawl your website and how often your website can be crawled. This assumes that the robot you're dealing with is polite.

让机器人远离您的站点的另一种方法是变得主动.

Another way to keep bots off your site is to get aggressive.

用户代理

某些攻击性行为包括(如其他用户所提到的)用户代理字符串的过滤.这可能是检测它是否是用户的最简单但也是最不可靠的方法.许多机器人倾向于欺骗用户代理,而某些机器人是出于正当理由(即,他们只想抓取移动内容),而其他机器人则根本不希望被识别为机器人.更糟糕的是,某些漫游器会欺骗合法/礼貌的漫游器代理,例如Google,Microsoft,Lycos和其他爬网程序的用户代理,这些代理通常被视为礼貌.依靠用户代理可能会有所帮助,但不能单靠自己.

Some aggressive behavior includes (as previously mentioned by other users) the filtering of user-agent strings. This is probably the simplest, but also the least reliable way to detect if it's a user or not. A lot of bots tend to spoof user agents and some do it for legitimate reasons (i.e. they only want to crawl mobile content), while others simply don't want to be identified as bots. Even worse, some bots spoof legitimate/polite bot agents, such as the user agents of google, microsoft, lycos and other crawlers which are generally considered polite. Relying on the user agent can be helpful, but not by itself.

有更激进的方式来处理欺骗用户代理并且不遵守您的robots.txt文件的机器人:

There are more aggressive ways to deal with robots that spoof user agents AND don't abide by your robots.txt file:

陷阱陷阱

我喜欢将其视为"Venus Fly Trap",它基本上惩罚了任何想和您一起玩花样的机器人.

I like to think of this as a "Venus Fly Trap," and it basically punishes any bot that wants to play tricks with you.

bot陷阱可能是查找不遵循robots.txt文件的bot的最有效方法,而又不会真正损害网站的可用性.创建漫游器陷阱可确保仅捕获漫游器,而不捕获真实用户.这样做的基本方法是在robots.txt文件中设置一个专门标记为禁止访问的目录,这样,任何礼貌的机械手都不会掉入陷阱.您要做的第二件事是从您的网站到bot陷阱目录放置一个隐藏"链接(这可以确保真实用户永远不会进入那里,因为真实用户永远不会单击不可见链接).最后,您禁止进入bot trap目录的任何IP地址.

A bot trap is probably the most effective way to find bots that don't adhere to your robots.txt file without actually impairing the usability of your website. Creating a bot trap ensures that only bots are captured and not real users. The basic way to do it is to setup a directory which you specifically mark as off limits in your robots.txt file, so any robot that is polite will not fall into the trap. The second thing you do is to place a "hidden" link from your website to the bot trap directory (this ensures that real users will never go there, since real users never click on invisible links). Finally, you ban any IP address that goes to the bot trap directory.

以下是有关如何实现此目的的说明: 创建漫游器(或者在您的情况下:

Here are some instructions on how to achieve this: Create a bot trap (or in your case: a PHP bot trap).

注意:当然,某些漫游器足够聪明,可以读取robots.txt文件,查看已标记为超出限制"的所有目录,并且仍会忽略您的礼貌设置(例如爬网率和允许的漫游器) .即使这些机器人不礼貌,他们也可能不会落入您的机器人陷阱.

Note: of course, some bots are smart enough to read your robots.txt file, see all the directories which you've marked as "off limits" and STILL ignore your politeness settings (such as crawl rate and allowed bots). Those bots will probably not fall into your bot trap despite the fact that they are not polite.

暴力

我认为这对于普通观众(和一般用途)来说太过激进了,因此,如果有18岁以下的孩子,请带他们到另一个房间!

I think this is actually too aggressive for the general audience (and general use), so if there are any kids under the age of 18, then please take them to another room!

您只需不指定robots.txt文件,即可将其设为"暴力".在这种情况下,抓取隐藏链接的任何BOT 都可能会进入僵尸陷阱,并且您可以禁止所有僵尸!

You can make the bot trap "violent" by simply not specifying a robots.txt file. In this situation ANY BOT that crawls the hidden links will probably end up in the bot trap and you can ban all bots, period!

不建议这样做的原因是,您实际上可能希望某些机器人对您的网站进行爬网(例如Google,Microsoft或其他用于站点索引的机器人).允许来自Google,Microsoft,Lycos等的漫游器礼貌地爬网您的网站,可以确保您的网站被编入索引,并且当人们在他们最喜欢的搜索引擎上搜索该网站时,该网站就会显示出来.

The reason this is not recommended is that you may actually want some bots to crawl your website (such as Google, Microsoft or other bots for site indexing). Allowing your website to be politely crawled by the bots from Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it shows up when people search for it on their favorite search engine.

自我毁灭

另一种限制机器人可以在您的网站上进行爬网的方法是服务验证码或机器人无法解决的其他挑战.这是以牺牲您的用户为代价的,并且我认为任何使您的网站不易使用的东西(例如CAPTCHA)都是自毁性的".当然,这实际上并不会阻止bot反复尝试爬网您的网站,只会使您的网站对他们非常无趣.有一些方法可以绕过"验证码,但是它们很难实现,因此我不会对此进行过多研究.

Yet another way to limits what bots can crawl on your website, is to serve CAPTCHAs or other challenges which a bot cannot solve. This comes at an expense of your users and I would think that anything which makes your website less usable (such as a CAPTCHA) is "self destructive." This, of course, will not actually block the bot from repeatedly trying to crawl your website, it will simply make your website very uninteresting to them. There are ways to "get around" the CAPTCHAs, but they're difficult to implement so I'm not going to delve into this too much.

出于您的目的,应对机器人的最佳方法可能是采用上述策略的组合:

For your purposes, probably the best way to deal with bots is to employ a combination of the above mentioned strategies:

  1. 过滤用户代理.
  2. 设置一个僵尸陷阱(暴力陷阱).

捕获所有进入暴力bot陷阱的bot,并将其IP列入黑名单(但不要阻止它们).这样,您仍然可以获得被僵尸程序爬网的好处",但是由于进入了僵尸程序陷阱,您将不必付费检查被列入黑名单的IP地址.

Catch all the bots that go into the violent bot trap and simply black-list their IPs (but don't block them). This way you will still get the "benefits" of being crawled by bots, but you will not have to pay to check the IP addresses that are black-listed due to going to your bot trap.

这篇关于如何识别网络爬虫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆