我如何prevent网站刮? [英] How do I prevent site scraping?

查看:154
本文介绍了我如何prevent网站刮?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的音乐网站用大艺术家数据库。我已经注意到其他音乐网站拼抢我们网站的数据(我在这里和那里,然后输入虚拟歌手姓名进行Google搜索他们的)。

I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).

我如何prevent屏幕抓取?它甚至有可能?

How can I prevent screen scraping? Is it even possible?

推荐答案

我将$您已设置的PΦ$ psume 的robots.txt

I will presume that you have set up robots.txt.

正如其他人所说,铲运机可以伪造近他们的活动的各个方面,它可能是很难确定从坏人来的请求。

As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.

我会考虑:


  1. 设置了一个网页, /jail.html

  2. 禁止访问该页面在的robots.txt (如此恭敬的蜘蛛绝不会参观)。

  3. 将在您的网页中的链接,用CSS隐藏它(显示:无)。

  4. 游人 /jail.html
  5. 记录的IP地址。

  1. Set up a page, /jail.html.
  2. Disallow access to the page in robots.txt (so the respectful spiders will never visit).
  3. Place a link on one of your pages, hiding it with CSS (display: none).
  4. Record IP addresses of visitors to /jail.html.

这可能会帮助您快速识别从刮刀的公然无视你的 robots.txt的请求

This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt.

您也可能想使你的 /jail.html 整个整个网站具有相同的,准确的标记为正常的网页,但与假数据( /监狱/专辑/ 63ajdka /监狱/音轨/ 3aads8 等)。这样,坏的刮刀将不会被惊动不寻常的输入,直到你完全阻止他们的机会。

You might also want to make your /jail.html a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka, /jail/track/3aads8, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.

这篇关于我如何prevent网站刮?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆