阻止Google Docs进行网站爬网 [英] Block Website Scraping by Google Docs

查看：72 发布时间：2020/11/19 1:16:28 web-scraping google-sheets google-docs google-sheets-importxml

本文介绍了阻止Google Docs进行网站爬网的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经营一个网站，该网站以图表/表格格式提供各种数据供人们阅读.最近，我注意到来自Google Docs的网站请求有所增加.查看IP和用户代理，它似乎确实来自Google服务器-此处的IP查找示例.

I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.

每天的点击数在2500至10,000个请求之间.

The number of hits is in the region of 2,500 to 10,000 requests per day.

我假设有人创建了一个或多个Google表格来从我的网站抓取数据(可能使用 IMPORTHTML 函数或类似的函数).我希望这不会发生(因为我不知道数据是否被正确分配).

I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).

是否存在阻止Google支持/批准的流量的首选方法?

我宁愿不基于IP地址进行阻止，因为阻止Google服务器感觉不对，并可能导致将来出现问题或IP可能更改.目前，我正在基于包含GoogleDocs或docs.google.com的用户代理进行阻止(返回403状态).

I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs or docs.google.com.

目前，流量主要来自66.249.89.221和66.249.89.223，始终与用户代理Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

作为第二个问题:是否可以跟踪文档或其帐户所有者?我可以访问他们正在访问的URL，但是随着请求的出现，其他操作几乎没有继续进行了.通过Google文档服务器代理(HTTP日志中没有Referer，Cookie或其他此类数据).

As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).

谢谢.

阻止Google Docs进行网站爬网 [英] Block Website Scraping by Google Docs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

阻止Google Docs进行网站爬网 [英] Block Website Scraping by Google Docs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭