如何prevent未经授权的蜘蛛 [英] How to prevent unauthorized spidering

查看:119
本文介绍了如何prevent未经授权的蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要prevent自动化HTML从我们的网站之一刮同时不影响合法蜘蛛(Googlebot的,等等)。有没有已经存在的做到这一点的东西吗?我是不是即使使用正确的术语?

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish this? Am I even using the correct terminology?

编辑:我主要是寻找prevent人们会被恶意这样做。即他们不会被robots.txt遵守

I'm mainly looking to prevent people that would be doing this maliciously. I.e. they aren't going to abide by robots.txt

EDIT2:约preventing使用什么是用速度...即CAPTCHA继续如果检测自动化,交通不是来自合法(谷歌,雅虎,MSN等)浏览IP 。

What about preventing use by "rate of use" ... i.e. captcha to continue browsing if automation is detected and the traffic isn't from a legitimate (google, yahoo, msn, etc.) IP.

推荐答案

这是困难的,如果不是不可能完成的任务。许多流氓蜘蛛/爬虫不要通过用户​​代理字符串自己的身份,所以很难辨别出来。您可以尝试通过其IP地址来阻止他们,但难以跟上新添加的IP地址到阻止​​列表。另外,也可以阻止合法用户,如果使用的IP地址,因为代理作出许多不同的客户机显示为单个的IP地址

This is difficult if not impossible to accomplish. Many "rogue" spiders/crawlers do not identify themselves via the user agent string, so it is difficult to identify them. You can try to block them via their IP address, but it is difficult to keep up with adding new IP addresses to your block list. It is also possible to block legitimate users if IP addresses are used since proxies make many different clients appear as a single IP address.

在这种情况下使用robots.txt的问题是,蜘蛛可以只选择忽略它。

The problem with using robots.txt in this situation is that the spider can just choose to ignore it.

编辑:限速是可能的,但它的一些识别(和跟踪)的好与坏用户代理/ IP地址的相同问题的困扰。在一个系统中,我们写做一些内部的页面视图/会话计数,我们消除基于页面浏览率会话,但我们也不用担心消除了良好的蜘蛛,因为我们不希望他们的数据统计要么根本。我们不做约$ P $从实际查看的页面pventing任何客户端什么。

Rate limiting is a possibility, but it suffers from some of the same problems of identifying (and keeping track of) "good" and "bad" user agents/IPs. In a system we wrote to do some internal page view/session counting, we eliminate sessions based on page view rate, but we also don't worry about eliminating "good" spiders since we don't want them counted in the data either. We don't do anything about preventing any client from actually viewing the pages.

这篇关于如何prevent未经授权的蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆