Web Scraper:限制在单个域上每分钟/小时的请求数? [英] Web Scraper: Limit to Requests Per Minute/Hour on Single Domain?

查看:455
本文介绍了Web Scraper:限制在单个域上每分钟/小时的请求数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与图书管理员合作重组他的组织的数字摄影档案。

I'm working with a librarian to re-structure his organization's digital photography archive.

我用机械化 BeautifulSoup 从集合中提取大约7000个结构不合理且温和不正确/不完整的文档。数据将被格式化为可用于纠正它的电子表格。现在我估计总共有7500个HTTP请求来构建搜索字典,然后收集数据,不计算我的代码中的错误和数据,然后在项目进展时再做更多。

I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.

我认为对于我能够多快地发出这些请求存在某种内在限制,即使不存在,我也会让我的机器人延迟与过度负担的Web服务器礼貌地表现出来。我的问题(诚然无法完全准确地回答)是关于在遇到内置速率限制之前我能多快发出HTTP请求?

I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?

我不想发布我们正在抓取的域名的网址,但如果相关,我会问我的朋友是否可以分享。

I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.

注意:我意识到这是解决问题的最佳方法(重组/组织数据库)但我们正在构建一个概念验证来说服上级人员相信我的朋友有了数据库的副本,他将从中导航必要的官僚机构以允许我直接处理数据。

Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.

他们也给了我们的API一个ATOM提要,但它需要一个关键字来搜索,并且对于逐步浏览特定集合中的每张照片的任务似乎毫无用处。

They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.

推荐答案

HTTP没有内置的速率限制。最常见的Web服务器未配置为开箱即用以限制速率。如果速率限制到位,网站的管理员几乎肯定会把它放在那里,你必须问他们他们配置了什么。

There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.

有些搜索引擎会尊重robots.txt的非标准扩展,建议使用速率限制,因此请在 robots.txt中检查 Crawl-delay

Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for Crawl-delay in robots.txt.

HTTP确实有两个连接的并发连接限制,但浏览器已经开始忽略它并且正在努力修改标准的那部分,因为它已经过时了。

HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.

这篇关于Web Scraper:限制在单个域上每分钟/小时的请求数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆