我被刮了,我该如何防止这种情况? [英] I'm being scraped, how can I prevent this?

查看:18
本文介绍了我被刮了,我该如何防止这种情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每周运行几次 IIS 7,我看到来自一个地理位置的大量 Google Analytics(分析)点击.他们正在查看的网址序列显然是由某种算法生成的,所以我知道我正在被抓取内容.有什么办法可以防止这种情况发生吗?谷歌不只是给我一个 IP 让我很沮丧.

Running IIS 7, a couple of times a week I see a huge number of hits on Google Analytics from one geographical location. The sequence of urls they are viewing are clearly being generated by some algorithm so I know I'm being scraped for content. Is there any way to prevent this? So frustrated that Google doesn't just give me an IP.

推荐答案

在反爬虫的世界里有很多技术.我只是将它们分类.如果您发现我的答案中缺少某些内容,请发表评论.

There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.

阻止可疑 IP 的效果很好,但如今大部分抓取都是使用 IP 代理完成的,因此从长远来看,它不会有效.在您的情况下,您会收到来自相同 IP 地理位置的请求,因此,如果您禁止该 IP,则抓取工具肯定会利用 IP 代理,从而保持 IP 独立且不被检测到.

The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.

使用 DNS 防火墙属于防刮措施.简而言之,这是将您的 Web 服务设置为专用域名服务器 (DNS) 网络,该网络将过滤并防止错误请求到达您的服务器.一些公司为复杂的网站保护提供了这种复杂的措施,您可能会更深入地查看 此类服务的一个示例.

Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.

正如您所提到的,您已经检测到一种算法,一个抓取工具会抓取网址.有一个自定义脚本来跟踪请求 url 并基于此开启保护措施.为此,您必须在 IIS 中激活 [shell] 脚本.副作用可能是系统响应时间会增加,从而减慢您的服务.顺便说一句,您检测到的算法可能会发生变化,因此会关闭此措施.

As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.

您可以设置请求频率或可下载数据量的限制.考虑到普通用户的可用性,必须应用这些限制.与抓取工具的持续请求相比,您可能会设置 Web 服务规则以丢弃或延迟不需要的活动.然而,如果刮板被重新配置以模仿常见的用户行为(通过一些当今知名的工具:Selenuim,机械化,iMacros) 这个措施将失败.

You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.

这个措施很好,但现代爬虫通常会执行会话身份验证,因此切断会话时间并不是那么有效.

This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.

这是大多数情况下解决抓取问题的旧时代技术.但是,如果您的抓取对手利用任何 反验证码服务 这种保护很可能会关闭.

This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.

JavaScript 代码应先于或与请求的 html 内容一起到达客户端(用户的浏览器或抓取服务器).此代码用于计数并将某个值返回给目标服务器.基于此测试,html 代码可能格式错误,甚至可能未发送给请求者,从而使恶意抓取工具无法使用.逻辑可能放在一个或多个 JavaScript 可加载文件中.此 JavaScript 逻辑可能不仅适用于整个内容,也可能仅适用于网站内容的某些部分(例如价格).为了绕过这个措施,爬虫可能需要转向甚至 更复杂的爬取逻辑(通常JavaScript) 是高度可定制的,因此成本很高.

JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.

这种内容保护方法如今已被广泛使用.它确实可以防止刮板收集数据.它的副作用是隐藏为图像的数据被搜索引擎索引,从而降低了网站的 SEO.如果爬虫利用 OCR 系统这种再次保护可能会被绕过.

This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.

这是非常有效的刮擦保护方法.它不仅可以更改元素 idsclasses,还可以更改整个层次结构.后者涉及样式重组,因此会增加额外成本.当然,如果要保持内容抓取,抓取端必须适应新的结构.如果您的服务负担得起,则不会产生太大的副作用.

This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.

这篇关于我被刮了,我该如何防止这种情况?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆