这种反抓取技术是否适用于 Robots.txt Crawl-Delay? [英] Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

查看:41
本文介绍了这种反抓取技术是否适用于 Robots.txt Crawl-Delay?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想防止网络抓取工具积极抓取我网站上的 1,000,000 个页面.我想通过向每分钟访问异常页数的机器人返回503 服务不可用"HTTP 错误代码来做到这一点.我没有遇到垃圾邮件发送者的问题,只是遇到了刮板.

I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers.

我不希望搜索引擎蜘蛛收到错误消息.我的倾向是设置 robots.txt 抓取延迟,以确保蜘蛛每分钟访问的页面数量低于我的 503 阈值.

I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold.

这是一个可以接受的解决方案吗?是否所有主要搜索引擎都支持 crawl-delay 指令?它会对SEO产生负面影响吗?还有其他解决方案或建议吗?

Is this an acceptable solution? Do all major search engines support the crawl-delay directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

推荐答案

我已经构建了几个爬虫,花费时间最长的部分总是试图弄清楚站点布局该爬什么,不该爬什么.我可以告诉你的是,改变 div 和内部布局对所有刮刀来说都是毁灭性的.就像 ConfusedMind 已经指出的那样.

I have built a few scrapers, and the part that takes the longest time is allways trying to figure out the site layout what to scrape and not. What I can tell you is that changing divs and internal layout will be devastating for all scrapers. Like ConfusedMind already pointed out.

这里有一些文字给你:

速率限制
对 IP 进行速率限制意味着在阻止它之前,您只允许该 IP 在固定时间范围内进行一定数量的搜索.这似乎是防止最严重的罪犯的可靠方法,但实际上并非如此.问题是您的大部分用户可能来自代理服务器或大型企业网关,他们经常与成千上万的其他用户共享这些网关.如果您对代理的 IP 进行速率限制,那么当来自代理的不同用户使用您的站点时,该限制很容易触发.仁慈的机器人也可能以比正常情况更高的速度运行,从而触发您的限制.

Rate limiting
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. This may seem sure way prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.

一种解决方案当然是使用白名单,但问题是您需要不断手动编译和维护这些列表,因为 IP 地址会随着时间而变化.毋庸置疑,一旦数据抓取工具意识到您对某些地址进行了速率限制,它们只会降低其速率或将搜索分布在更多 IP 上.

One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.

为了使速率限制有效并且不妨碍网站的大用户,我们通常建议在阻止他们之前调查超出速率限制的每个人.

验证码测试
验证码测试是一种尝试阻止网站抓取的常用方法.这个想法是让图片显示一些机器无法读取但人类可以读取的文本和数字(见图).这种方法有两个明显的缺点.首先,如果用户必须填写多个,验证码测试可能会让用户感到厌烦.其次,网络爬虫可以轻松地手动进行测试,然后让他们的脚本运行.除此之外,验证码测试的几个大用户的实现也受到了损害.混淆源代码

In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.

Captcha tests
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised. Obfuscating source code

一些解决方案试图混淆 http 源代码,使机器更难阅读它.这种方法的问题在于,如果 Web 浏览器可以理解混淆的代码,那么任何其他程序也可以.混淆源代码也可能会干扰搜索引擎如何查看和处理您的网站.如果您决定实施此操作,则应非常小心.

黑名单
由已知用于抓取站点的 IP:s 组成的黑名单本身并不是一种真正的方法,因为您仍然需要先检测抓取器才能将其列入黑名单.即便如此,它仍然是一种钝器,因为 IP 往往会随着时间而改变.最后,您最终将使用此方法阻止合法用户.如果您仍决定实施黑名单,则应制定至少每月审查一次的程序.

Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.

Blacklists
Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.

这篇关于这种反抓取技术是否适用于 Robots.txt Crawl-Delay?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆