检测“隐形"网络爬虫 [英] Detecting 'stealth' web-crawlers

查看:27
本文介绍了检测“隐形"网络爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有哪些选项可以检测不想被检测到的网络爬虫?

(我知道列表检测技术将使智能隐形爬虫程序员能够制作出更好的蜘蛛,但我认为我们无论如何都无法阻止智能隐形爬虫,只能阻止那些犯错的人.)

(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)

我不是在谈论诸如 googlebot 和 Yahoo!啜饮.我认为机器人很好,如果它:

I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it:

  1. 在用户代理字符串中将自己标识为机器人
  2. 读取 robots.txt(并遵守它)

我说的是爬虫,它们躲在普通用户代理后面,使用我的带宽并且从不给我任何回报.

I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.

有一些活板门可以构建更新列表(感谢克里斯,gs):

There are some trapdoors that can be constructed updated list (thanks Chris, gs):

  1. 添加仅在 robots.txt 中列出(标记为不允许)的目录,
  2. 添加不可见链接(可能标记为 rel="nofollow"?),
    • style="显示:无;"在链接或父容器上
    • 放置在另一个具有更高 z-index 的元素下方

好"和坏"机器人都会触发一些陷阱.你可以将它们与白名单结合起来:

Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist:

  1. 触发陷阱
  2. 它请求robots.txt?
  3. 它不会触发另一个陷阱,因为它遵守了robots.txt

这里的另一件重要事情是:
请考虑使用屏幕阅读器的盲人:为人们提供与您联系的方式,或解决(非图像)验证码以继续浏览.

One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.

有哪些方法可以自动检测到试图将自己伪装成正常人类访问者的网络爬虫.

更新
问题不是:我如何捕获每一个爬虫.问题是:我怎样才能最大限度地检测到一个爬虫.

有些蜘蛛真的很厉害,居然能解析理解html、xhtml、css、javascript、VB脚本等...
我没有幻想:我将无法击败他们.

Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc...
I have no illusions: I won't be able to beat them.

然而,您会惊讶于一些爬虫是多么愚蠢.最愚蠢的例子(在我看来)是:在请求之前将所有 URL 转换为小写.

You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.

然后有一大堆爬虫不够好",无法避开各种活板门.

And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

推荐答案

不久前,我与一家小型托管公司合作,帮助他们实施解决方案.我开发的系统检查 Web 服务器日志中是否有来自任何给定 IP 地址的过度活动,并发布防火墙规则以阻止违规者.它包括基于 http://www.iplists.com/ 的 IP 地址/范围白名单,然后通过检查声称的用户代理字符串根据需要自动更新,如果客户端声称是合法蜘蛛但不在白名单中,则它执行 DNS/反向 DNS 查找以验证源 IP 地址是否与声称的所有者相对应机器人.作为故障安全措施,这些操作已通过电子邮件报告给管理员,并附有链接,以便在评估不正确时将地址列入黑名单/白名单.

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

我已经有 6 个月左右的时间没有和那个客户谈过了,但是,我最后一次听说,该系统的运行非常有效.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

旁注:如果您正在考虑基于命中率限制建立类似的检测系统,请务必使用至少一分钟(最好至少为五分钟)的总数.我看到很多人在谈论这些类型的计划,他们想阻止任何在一秒钟内点击 5-10 次的人,这可能会在大量图像的页面上产生误报(除非图像被排除在计数之外)和 当像我这样的人找到一个他想要阅读全部内容的有趣网站时,会产生误报,因此他在阅读第一个时打开标签中的所有链接以在后台加载.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

这篇关于检测“隐形"网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆