防止网页抓取 [英] Protection from Web Scraping

查看:38
本文介绍了防止网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前是一个开发应用程序的团队的一员,其中包括一个前端客户端.

I am currently part of a team developing an application which includes a front end client.

我们通过这个客户端发送用户数据,每个用户都有一个用户 ID,客户端通过 RESTful API 与我们的服务器对话,向服务器询问数据.

Through this client we send the user data, each user has a user-id and the client talks to our server through a RESTful API asking the server for data.

例如,假设我们有一个图书数据库,用户可以获取作者最近写的 3 本书.我们重视用户的时间,我们希望用户无需明确注册即可开始使用产品.

For example, let's say we have a database of books, and the user can get the last 3 books an author wrote. We value our users' time and we would like users to be able to start using the product without explicit registration.

我们重视我们的数据库,我们使用我们自己的专有软件来填充它,并希望尽可能多地保护它.

We value our database, we use our own proprietary software to populate it and would like to protect it as much as we can.

所以基本上问题是:

我们可以做些什么来保护自己免受网络抓取?

What can we do to protect ourselves from web scraping?

我非常想了解一些保护我们数据的技术,我们希望防止用户在作者搜索面板中输入每个作者的名字,并提取每个作者写的前三本书.

I would very much like to learn about some techniques to protect our data, we would like to prevent users from typing every single author name in the author search panel and fetching out the top three books every author wrote.

任何建议的阅读将不胜感激.

Any suggested reading would be appreciated.

我只想提一下,我们知道验证码,并希望尽可能避免它们

I'd just like to mention we're aware of captchas and would like to avoid them as much as possible

推荐答案

防止这种情况的主要策略是:

The main strategies for preventing this are:

  • 需要注册,因此您可以限制每个用户的请求
  • 注册和非注册用户的验证码
  • IP 的速率限制
  • 需要 JavaScript - 编写一个可以读取 JS 的爬虫更难
  • 机器人拦截和机器人检测(例如请求率、隐藏链接陷阱)
  • 数据中毒.放入任何人都不想拥有的书籍和链接,这会阻碍盲目收集所有内容的机器人的下载.
  • 突变.经常更换模板,以免爬虫找不到想要的内容.

请注意,您可以非常灵活地使用验证码.

Note that you can use Captchas very flexible.

例如:每天每个 IP 的第一本书不受验证码保护.但是为了访问第二本书,需要解决验证码问题.

For example: first book for each IP every day is non-captcha protected. But in order to access a second book, a captcha needs to be solved.

这篇关于防止网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆