防止屏幕刮擦 [英] Protection from screen scraping

查看:49
本文介绍了防止屏幕刮擦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继我关于屏幕抓取的合法性,即使它是非法的人们仍然会尝试,所以:

Following on from my question on the Legalities of screen scraping, even if it's illegal people will still try, so:

可以采用哪些技术机制来防止或至少抑制屏幕抓取?

What technical mechanisms can be employed to prevent or at least disincentivise screen scraping?

哦,只是为了咧嘴笑和让生活变得困难,保留搜索引擎的访问权限可能很好.我在这里很可能是在为魔鬼辩护,但有一个严重的潜在问题.

Oh and just for grins and to make life difficult, it may well be nice to retain access for search engines. I may well be playing devil's advocate here but there is a serious underlying point.

推荐答案

因此,一种方法是混淆代码(rot13 或其他),然后在页面中使用一些 javascript 来执行类似 document.write(unobfuscate(obfuscated_pa​​ge)).但这完全打击了搜索引擎(可能!).

So, one approach would be to obfuscate the code (rot13, or something), and then have some javascript in the page that do something like document.write(unobfuscate(obfuscated_page)). But this totally blows away search engines (probably!).

当然,这实际上也不能阻止想要窃取您数据的人,但确实会让事情变得更难.

Of course this doesn’t actually stop someone who wants to steal your data either, but it does make it harder.

一旦客户端获得数据,游戏就结束了,因此您需要查看服务器端的某些内容.

Once the client has the data it is pretty much game over, so you need to look at something on the server side.

鉴于搜索引擎基本上是屏幕抓取工具,所以事情很困难.您需要了解屏幕抓取工具和屏幕抓取工具之间的区别.当然,您也只有普通的人类用户.因此,这归结为一个问题,即您如何在服务器上有效地将请求分类为来自屏幕抓取工具或 屏幕抓取工具.

Given that search engines are basically screen scrapers things are difficult. You need to look at what the difference between the good screen scrapers and the bad screen scrapers are. And of course, you have just the normal human users as well. So this comes down to a problem of how can you on the server effectively classify as request as coming from a human, a good screen scraper, or a bad screen scraper.

因此,开始的地方是查看您的日志文件,看看是否有某种模式可以让您有效地对请求进行分类,然后在确定该模式时查看是否存在某种错误的方式 屏幕抓取工具,在了解这种分类后,可以伪装成人类优秀屏幕抓取工具.

So, the place to start would be looking at your log-files and seeing if there is some pattern that allows you to effectively classify requests, and then on determining the pattern see if there is some way that a bad screen scraper, upon knowing this classification, could cloak itself to appear like a human or good screen scraper.

一些想法:

  • 您可以通过 IP 地址确定好的屏幕抓取工具.
  • 您可以通过并发连接数、每个时间段的连接总数、访问模式等来确定抓取工具与人工.

显然这些都不是理想的或万无一失的.另一种策略是确定您可以采取哪些措施对人类不显眼,但(可能)对刮板者来说很烦人.一个例子可能是减慢请求的数量.(取决于请求的时间紧迫性.如果他们实时抓取,这会影响他们的最终用户).

Obviously these aren’t ideal or fool-proof. Another tactic is to determine what measures can you take that are unobtrusive to humans, but (may be) annoying for scrapers. An example might be slowing down the number of requests. (Depends on the time criticality of the request. If they are scraping in real-time, this would effect their end users).

另一方面是着眼于更好地为这些用户服务.显然,他们正在抓取,因为他们想要数据.如果您为他们提供一种直接获取有用格式数据的简单方法,那么他们将更容易做到而不是屏幕抓取.如果有一种简单的方法,则可以规范对数据的访问.例如:给请求者一个唯一的key,然后限制每个key的请求次数,以避免服务器过载,或者每1000个请求收费等.

The other aspect is to look at serving these users better. Clearly they are scraping because they want the data. If you provide them an easy way in which to directly obtain the data in a useful format then that will be easier for them to do instead of screen scraping. If there is an easy way then access to the data can be regulated. E.g: give requesters a unique key, and then limit the number of requests per key to avoid overload on the server, or charge per 1000 requests, etc.

当然,仍然有人想欺骗你,然后可能还有其他方法来抑制你的积极性,但他们可能开始变得非技术性,并且需要通过法律途径被说服.

Of course there are still people who will want to rip you off, and then there are probably other ways to disincentivise, bu they probably start being non-technical, and require legal avenues to be persued.

这篇关于防止屏幕刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆