在Spark中使用期货 [英] Using Futures within Spark

查看:101
本文介绍了在Spark中使用期货的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark作业为RDD中的每个元素提供一个远程Web服务.一个简单的实现可能看起来像这样:

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:

def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))

(上面的示例保持简单,不处理超时).

(The above example has been kept simple and does not handle timeouts).

对于RDD的不同元素,任何结果之间都没有相互依赖性.

There is no interdependency between any of the results for different elements of the RDD.

通过使用Future通过为RDD的每个元素对Web服务进行并行调用来优化性能,是否可以改善上述情况?还是Spark本身内置了优化级别,以便它可以并行在RDD中的每个元素上运行操作?

Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?

如果可以通过使用Futures来优化上述内容,是否有人可以通过一些代码示例显示在传递给Spark RDD的函数中使用Futures的正确方法.

If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.

谢谢

推荐答案

还是Spark本身内置了优化级别,以便它可以并行在RDD中的每个元素上运行操作?

Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?

不是. Spark在分区级别并行化任务,但是默认情况下,每个分区都在单个线程中按顺序处理.

It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.

通过使用Futures可以改善上述情况

Would the above be improved by using Futures

这可能是一种改进,但是很难做到正确.特别是:

It could be an improvement but is quite hard to do it right. In particular:

  • 每个Future必须在同一阶段完成,然后才能进行任何改组.
  • 由于Iterators用于显示分区数据的懒惰性质,您无法像map这样的高级基元执行此操作(例如,请参见通过异步HTTP调用生成火花).
  • 您可以使用mapPartitions构建自定义逻辑,但随后必须处理非延迟分区评估的所有后果.
  • every Future has to be completed in the same stage before any reshuffle takes place.
  • given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
  • you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.

这篇关于在Spark中使用期货的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆