来自Spark执行者的节流并发HTTP请求 [英] Throttle concurrent HTTP requests from Spark executors

查看:732
本文介绍了来自Spark执行者的节流并发HTTP请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从Spark作业内部向限速API发出一些Http请求。为了跟踪非分布式系统(在Scala中)的并发请求数,请执行以下操作:

I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:


  • 一个节流角色,它维持一个信号量(计数器),该信号量在请求开始时增加,在请求完成时减少。尽管 Akka 是分布式的,但是在分布式Spark上下文中存在 actorSystem 的(反)序列化的问题。

  • 在fs2中使用并行流: https://fs2.io /concurrency-primitives.html =>无法分发。

  • 我想我也可以收集数据帧到Spark driver 并使用上述选项之一处理节流,但我想保持此状态不变。

  • a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although Akka is distributed, there are issues to (de)serialize the actorSystem in a distributed Spark context.
  • using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
  • I suppose I could also just collect the dataframes to the Spark driver and handle throttling there with one of above options, but I would like to keep this distributed.

通常如何处理这些事情?

How are such things typically handled?

推荐答案

您不应尝试跨Spark同步请求执行者/分区。这完全违反了Spark并发模型。

You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.

例如,将全局速率限制R除以执行者*核心数,然后使用 mapPatitions 发送请求$每个分区的R /(e * c)速率限制内的b $ b。

Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions to send requests from each partition within its R/(e*c) rate limit.

这篇关于来自Spark执行者的节流并发HTTP请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆