检测网页抓取的方式 [英] The way to detect web scraping

查看:179
本文介绍了检测网页抓取的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要检测的信息刮在我的网站。我想检测的基础上的行为模式,它似乎是有希望的,虽然比较重的计算。

I need to detect scraping of info on my website. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy.

该基地是收集特定的客户端的请求,时间戳和他们的行为模式与普通模式或precomputed模式进行比较。

The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern.

要更precise,我收集请求到阵列之间的时间间隔,通过时间的函数索引:

To be more precise, I collect time intervals between requests into array, indexed by function of time:

i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client

,其中N是时间(数)的限制,间隔大于N被丢弃。最初的X和Y都充满了的。

where N is time (count) limit, intervals greater than N are dropped. Initially X and Y are filled with ones.

然后,当我在X和Y凑够一定数量他们,是时候做出决定。标准是参数C:

Then, after I got enough number of them in X and Y, it's time to make decision. Criteria is parameter C:

C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)

,其中X为特定的客户端的数据中,Y是共同的数据,和范数()是校准功能,并且k是归一化系数,根据规范的类型()。有3种类型:

where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). There are 3 types:

  1. 规范(X)= SUMM(X)/数(X),K = 2
  2. 规范(X)=开方(SUMM(X [I] ^ 2),K = 2
  3. 规范(X)= MAX(X [I]),k是一个非空的元素X数字的平方根
  1. norm(X) = summ(X)/count(X), k = 2
  2. norm(X) = sqrt(summ(X[i]^2), k = 2
  3. norm(X) = max(X[i]), k is square root of number of non-empty elements X

C为范围(0..1),0表示没有行为偏差,1是最大偏差。

C is in range (0..1), 0 means there is no behavior deviation and 1 is max deviation.

1型Сalibration是最好的重复请求,类型2的重复与几个区间,3型非恒定的请求的时间间隔要求。

Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals.

你怎么看?我将AP preciate如果你试试这个在您的服务。

What do you think? I'll appreciate if you'll try this on your services.

推荐答案

说实话你的做法是完全没有价值,因为它的琐碎绕行。攻击者甚至没有写一行code绕过它。代理服务器是免费,你可以启动一个新的机器在Amazon EC2上一个新的IP地址,一个小时2美分。

To be honest your approach is completely worthless because its trivial bypass. An attacker doesn't even have to write a line of code to bypass it. Proxy servers are free and you can boot up a new machine with a new ip address on amazon ec2 for 2 cents an hour.

一个更好的方法是 Roboo 它使用的cookie技术来挫败机器人。机器人绝大多数不能运行JavaScript或Flash,这可以用于你的优势。

A better approach is Roboo which uses cookie techniques to foil robots. The vast majority of robots can't run javascript or flash, and this can be used to your advantage.

不过这一切(中)安全虽然默默无闻,并在唯一原因为什么它可能工作是因为你的数据不值得程序员投入5分钟吧。 (包括Roboo)

However all of this "(in)security though obscurity", and the ONLY REASON why it might work is because your data isn't worth a programmer spending 5 minutes on it. (Roboo included)

这篇关于检测网页抓取的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆