Web Scraper：限制在单个域上每分钟/小时的请求数？ [英] Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

查看：455 发布时间：2018/7/10 9:31:58 python http mechanize mechanize-python

本文介绍了Web Scraper：限制在单个域上每分钟/小时的请求数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在与图书管理员合作重组他的组织的数字摄影档案。

I'm working with a librarian to re-structure his organization's digital photography archive.

我用机械化和 BeautifulSoup 从集合中提取大约7000个结构不合理且温和不正确/不完整的文档。数据将被格式化为可用于纠正它的电子表格。现在我估计总共有7500个HTTP请求来构建搜索字典，然后收集数据，不计算我的代码中的错误和数据，然后在项目进展时再做更多。

I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.

我认为对于我能够多快地发出这些请求存在某种内在限制，即使不存在，我也会让我的机器人延迟与过度负担的Web服务器礼貌地表现出来。我的问题（诚然无法完全准确地回答）是关于在遇到内置速率限制之前我能多快发出HTTP请求？

I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?

我不想发布我们正在抓取的域名的网址，但如果相关，我会问我的朋友是否可以分享。

I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.

注意：我意识到这是不解决问题的最佳方法（重组/组织数据库）但我们正在构建一个概念验证来说服上级人员相信我的朋友有了数据库的副本，他将从中导航必要的官僚机构以允许我直接处理数据。

Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.

他们也给了我们的API一个ATOM提要，但它需要一个关键字来搜索，并且对于逐步浏览特定集合中的每张照片的任务似乎毫无用处。

They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.

Web Scraper：限制在单个域上每分钟/小时的请求数？ [英] Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Web Scraper：限制在单个域上每分钟/小时的请求数？ [英] Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭