来自Spark执行者的节流并发HTTP请求 [英] Throttle concurrent HTTP requests from Spark executors
问题描述
我想从Spark作业内部向限速API发出一些Http请求。为了跟踪非分布式系统(在Scala中)的并发请求数,请执行以下操作:
I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:
- 一个节流角色,它维持一个信号量(计数器),该信号量在请求开始时增加,在请求完成时减少。尽管
Akka
是分布式的,但是在分布式Spark上下文中存在actorSystem
的(反)序列化的问题。 - 在fs2中使用并行流: https://fs2.io /concurrency-primitives.html =>无法分发。
- 我想我也可以
收集
数据帧到Sparkdriver
并使用上述选项之一处理节流,但我想保持此状态不变。
- a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although
Akka
is distributed, there are issues to (de)serialize theactorSystem
in a distributed Spark context. - using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
- I suppose I could also just
collect
the dataframes to the Sparkdriver
and handle throttling there with one of above options, but I would like to keep this distributed.
通常如何处理这些事情?
How are such things typically handled?
推荐答案
您不应尝试跨Spark同步请求执行者/分区。这完全违反了Spark并发模型。
You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.
例如,将全局速率限制R除以执行者*核心数,然后使用 mapPatitions
发送请求$每个分区的R /(e * c)速率限制内的b $ b。
Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions
to send requests
from each partition within its R/(e*c) rate limit.
这篇关于来自Spark执行者的节流并发HTTP请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!