如何在Apache Spark作业中执行阻止IO? [英] How do you perform blocking IO in apache spark job?

查看:107
本文介绍了如何在Apache Spark作业中执行阻止IO?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果遍历RDD时需要调用外部(阻塞)服务来计算数据集中的值怎么办?您认为如何实现?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?

val值:Future[RDD[Double]] = Future sequence tasks

我试图创建一个期货列表,但是由于RDD ID不可遍历,因此Future.sequence不适合.

I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.

我只是想知道,是否有人遇到过这样的问题,您是如何解决的? 我要实现的目标是在单个工作程序节点上获得并行性,因此我可以每将该外部服务调用 3000 次.

I just wonder, if anyone had such a problem, and how did you solve it? What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.

可能还有另一种更适合于火花的解决方案,例如在单个主机上具有多个工作节点.

Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.

很有趣的是,您如何应对这样的挑战?谢谢.

It's interesting to know, how do you cope with such a challenge? Thanks.

推荐答案

以下是我自己的问题的答案:

Here is answer to my own question:

val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
  future {
    // call native code
  }
}

val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
  val searchFuture: Future[Iterator[Object]] = Future sequence f
  Await result (searchFuture, JOB_TIMEOUT)
}

这里的想法是,我们得到分区的集合,其中每个分区都发送给特定的工作程序,并且是最小的工作.每一项工作都包含数据,可以通过调用本机代码并发送该数据来对其进行处理.

The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.

'values'集合包含数据,这些数据是从本机代码返回的,并且在整个集群中都已完成.

'values' collection contains the data, that is returned from the native code and that work is done across the cluster.

这篇关于如何在Apache Spark作业中执行阻止IO?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆