网站如何检测机器人? [英] How do websites detect bots?

查看:503
本文介绍了网站如何检测机器人?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习python,目前正在抓取reddit. Reddit莫名其妙地发现我是一个bot(实际上是我的软件),但是他们怎么知道呢?以及我们如何欺骗他们以为我们是普通用户.

I am learning python and i am currently scraping reddit. Somehow reddit has figured out that I am a bot (which my software actually is) but how do they know that? And how we trick them into thinking that we are normal users.

我为此找到了可行的解决方案,但是我要求对理论有更深入的了解.

I found practical solution for that, but I am asking for bit more in depth theoretical understanding.

推荐答案

互联网服务提供商可以使用多种技术来检测和抵御僵尸程序和爬虫程序.所有这些方法的核心是建立启发式和统计模型,以识别非人类行为.诸如:

There's a large array of techniques that internet service providers use to detect and combat bots and scrapers. At the core of all of them is to build heuristics and statistical models that can identify non-human-like behavior. Things such as:

  • 每个特定时间范围内某个IP发出的请求总数,例如,每秒超过50个请求,每分钟500个请求或每天5000个以上的请求,似乎是可疑的,甚至是恶意的.对每单位时间每个IP的请求计数进行计数是一种非常普遍且有效的技术.

  • Total number of requests from a certain IP per specific time frame, for example, anything more than 50 requests per second, or 500 per minute, or 5000 per day may seem suspicious or even malicious. Counting number of requests per IP per unit of time is a very common, and arguably effective, technique.

传入请求速率的规律性,例如每秒10个请求的持续流,似乎像是对机器人进行了编程,可以使其发出请求,稍等片刻,发出下一个请求,等等.

Regularity of incoming requests rate, for example, a sustained flow of 10 requests per second may seem like a robot programmed to make a request, wait a little, make the next request, and so on.

HTTP标头.浏览器随每个请求发送可预测的User-Agent标头,以帮助服务器识别其供应商,版本和其他信息.与其他标头结合使用,服务器可能能够确定请求来自未知来源或其他利用来源.

HTTP Headers. Browsers send predictable User-Agent headers with each request that helps the server identify their vendor, version, and other information. In combination with other headers, a server might be able to figure out that requests are coming from an unknown or otherwise exploitative source.

身份验证令牌,cookie,加密密钥和其他短暂信息的有状态组合,这些信息需要以特殊方式形成和提交后续请求.例如,服务器可能会向下发送某个密钥(通过Cookie,标头,响应正文中的标题等),并期望您的浏览器在对该服务器的后续请求中包含或以其他方式使用该密钥.如果太多请求无法满足该条件,则表明它们可能来自机器人.

A stateful combination of authentication tokens, cookies, encryption keys, and other ephemeral pieces of information that require subsequent requests to be formed and submitted in a special manner. For example, the server may send down a certain key (via cookies, headers, in the response body, etc) and expect that your browser include or otherwise use that key for the subsequent request it makes to the server. If too many requests fail to satisfy that condition, it's a telltale sign they might be coming from a bot.

鼠标和键盘跟踪技术:如果服务器知道仅当用户单击某个按钮时才可以调用某个API,则他们可以编写前端代码以确保检测到正确的鼠标活动(也就是说,用户在发出API请求之前确实点击了该按钮).

Mouse and keyboard tracking techniques: if the server knows that a certain API can only be called when the user clicks a certain button, they can write front-end code to ensure that the proper mouse-activity is detected (i.e. the user did actually click on the button) before the API request is made.

还有许多其他技术.假设您是试图检测并阻止bot活动的人.您将采取什么方法来确保请求来自人类用户?您将如何定义与机器人行为相反的人类行为,以及可以使用哪些指标来区分两者?

And many many more techniques. Imagine you are the person trying to detect and block bot activity. What approaches would you take to ensure that requests are coming from human users? How would you define human behavior as opposed to bot behavior, and what metrics can you use to discern the two?

还有一个实用性问题:某些方法成本更高且难以实施.那么问题将是:您需要在多大程度上(可靠程度)检测并阻止机器人活动?您是否在与试图入侵用户帐户的机器人作斗争?还是只是需要阻止它们(也许以尽力而为的方式)从其他本来公开的网页上抓取一些数据呢?如果检测到假阴性和假阳性,您会怎么做?这些问题告诉您识别和阻止bot活动可能采用的方法的复杂性和独创性.

There's a question of practicality as well: some approaches are more costly and difficult to implement. Then the question will be: to what extent (how reliably) would you need to detect and block bot activity? Are you combatting bots trying to hack into user accounts? Or do you simply need to prevent them (perhaps in a best-effort manner) from scraping some data from otherwise publicly visible web pages? What would you do in case of false-negative and false-positive detections? These questions inform the complexity and ingenuity of the approach you might take to identify and block bot activity.

这篇关于网站如何检测机器人?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆