阻止Google Docs进行网站爬网 [英] Block Website Scraping by Google Docs

查看:72
本文介绍了阻止Google Docs进行网站爬网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经营一个网站,该网站以图表/表格格式提供各种数据供人们阅读.最近,我注意到来自Google Docs的网站请求有所增加.查看IP和用户代理,它似乎确实来自Google服务器-此处的IP查找示例.

I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.

每天的点击数在2500至10,000个请求之间.

The number of hits is in the region of 2,500 to 10,000 requests per day.

我假设有人创建了一个或多个Google表格来从我的网站抓取数据(可能使用 IMPORTHTML 函数或类似的函数).我希望这不会发生(因为我不知道数据是否被正确分配).

I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).

是否存在阻止Google支持/批准的流量的首选方法?

我宁愿不基于IP地址进行阻止,因为阻止Google服务器感觉不对,并可能导致将来出现问题或IP可能更改.目前,我正在基于包含GoogleDocsdocs.google.com的用户代理进行阻止(返回403状态).

I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs or docs.google.com.

目前,流量主要来自66.249.89.221和66.249.89.223,始终与用户代理Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

作为第二个问题:是否可以跟踪文档或其帐户所有者?我可以访问他们正在访问的URL,但是随着请求的出现,其他操作几乎没有继续进行了.通过Google文档服务器代理(HTTP日志中没有Referer,Cookie或其他此类数据).

As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).

谢谢.

推荐答案

阻止User-Agent是一个很好的解决方案,因为似乎没有办法设置其他User-Agent并仍使用INPUTHTML函数-并且由于您很乐意禁止在文档工作表中使用全部"功能,所以这很完美.

Blocking on User-Agent is great solution because there doesn't appear to be a way to set a different User-Agent and still use INPUTHTML function -- and since you're happy to ban 'all' usage from doc-sheets, that's perfect.

其他想法,但如果全面禁止使用则令人不快:

Additional thoughts, though if full on ban seems unpleasant:

  1. 对速率进行限制:正如您所说的,您认识到它主要来自两个IP,并且始终使用相同的用户代理,只会减慢响应速度.只要请求是连续的,您就可以提供数据,但要通过一次传递,这足以阻止抓取.将对疑似刮板的响应延迟20或30秒.

  1. Rate limit it: as you say you're recognizing it's mostly coming from two IP and always with the same user agent, just slow down your response. As long as the requests are serial, the you can provide data, yet at a pass which may be sufficient to discourage scraping. Delay your response (to suspected scrapers) by 20 or 30 seconds.

重定向到您被阻止"屏幕,或带有默认"数据(即可抓取但不包含当前数据)的屏幕.比基本403更好,因为它会告诉人类这不是用于抓取,然后您可以将其定向为购买权限(或至少要求您提供密钥).

Redirect to "You're blocked" screen, or screen with "default" data (i.e., scrapable, but not with current data). Better than basic 403 because it will tell the human it's not for scraping and then you can direct them to purchasing access (or at least requesting a key from you.)

这篇关于阻止Google Docs进行网站爬网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆