我被刮了,我怎么能阻止这个? [英] I'm being scraped, how can I prevent this?

查看:112
本文介绍了我被刮了,我怎么能阻止这个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行IIS 7,每周几次,我从一个地理位置看到Google Analytics上的大量点击。他们正在查看的网址序列显然是由某些算法生成的,所以我知道我正在抓内容。有什么办法可以防止这种情况吗?非常沮丧,谷歌不仅仅给我一个IP。

Running IIS 7, a couple of times a week I see a huge number of hits on Google Analytics from one geographical location. The sequence of urls they are viewing are clearly being generated by some algorithm so I know I'm being scraped for content. Is there any way to prevent this? So frustrated that Google doesn't just give me an IP.

推荐答案

反刮世界中有很多技术。我只是将它们分类。如果您在我的回答中发现遗漏的内容请发表评论。

There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.

阻塞可疑IP工作正常,但今天大多数抓取都是使用IP代理完成的,所以从长远来看它不会有效。在您的情况下,您从相同的IP地理位置获得请求,因此如果您禁止此IP,则刮刀肯定会利用IP代理,从而保持IP独立性和未检测到。

The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.

使用DNS防火墙属于防刮措施。简而言之,这就是将您的Web服务设置为专用域名服务器(DNS)网络,该网络将在到达您的服务器之前过滤和阻止错误请求。一些公司为复杂的网站保护提供了这种复杂的措施,您可能会更深入地查看这种服务的一个例子。

Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.

正如您所提到的,您已经检测到算法,刮刀会抓取网址。有一个跟踪请求URL的自定义脚本,并基于此打开保护措施。为此,您必须在IIS中激活[shell]脚本。副作用可能是系统响应时间会增加,从而减慢您的服务速度。顺便说一句,您检测到的算法可能会被更改,从而关闭此度量。

As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.

您可以设置请求频率或可下载数据量的限制。考虑到普通用户的可用性,必须应用这些限制。与刮刀持续请求相比,您可以将Web服务规则设置为丢弃或延迟不需要的活动。然而,如果刮刀被重新配置以模仿普通用户行为(通过一些现在众所周知的工具: Selenuim ,Mechanize, iMacros )此措施将失败。

You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.

这个措施很好,但通常现代刮刀都会执行会话认证,因此切断会话时间并不是那么有效。

This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.

这是旧时代的技术,大多数情况下解决了抓取问题。然而,如果你的抓斗对手利用任何反验证码服务这种保护很可能会被取消。

This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.

JavaScript代码应在请求的html内容之前或之后到达客户端(用户的浏览器或抓取服务器)。此代码用于计算特定值并将其返回给目标服务器。基于此测试,html代码可能格式不正确,甚至可能无法发送给请求者,从而导致恶意刮刀关闭。逻辑可以放在一个或多个可加载JavaScript的文件中。此JavaScript逻辑不仅可以应用于整个内容,还可以仅应用于网站内容的某些部分(例如价格)。为了绕过这一措施,刮刀可能需要转向甚至更复杂的抓取逻辑(通常是JavaScript)高度可定制,因此成本高昂。

JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.

这种内容保护方法如今被广泛使用。它确实可以防止刮刀收集数据。它的副作用是数据被混淆为图像被隐藏用于搜索引擎索引,从而降低了网站的搜索引擎优化。如果刮刀利用 OCR 系统进行此类保护再次可能被绕过。

This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.

这是刮擦保护的有效方法。它不仅可以更改元素 ids ,还可以更改整个层次结构。后者涉及造型重组,从而增加了额外费用。当然,如果想要保持内容刮擦,刮刀侧必须适应新结构。如果您的服务可能负担得起,副作用不大。

This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.

这篇关于我被刮了,我怎么能阻止这个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆