在Spark中鼓励使用并行集合 [英] Is using parallel collections encouraged in Spark

查看:103
本文介绍了在Spark中鼓励使用并行集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark上使用并行集合有意义吗?

Does it make sense to use parallel collections on Spark?

到目前为止,我看到的所有Spark示例都始终使用非常简单的数据类型(单个类和元组)的RDD.但是实际上,集合(特别是并行集合)可以用作RDD的居民.

All the Spark examples I saw so far always used RDD of a very simple data types (Single classes and Tuples). But in fact collections and specifically parallel collections may be used as residents of the RDD.

该工作程序可能具有几个可用于执行的核心,并且如果使用常规集合作为RDD驻留,则这些额外的核心将保持空闲状态.

The worker may have several cores available for execution and if a regular collection is used as RDD resident those extra cores will stay idle.

测试我与本地经理一起跑过.

Test I ran with local manager.

val conf: SparkConf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val sc = new SparkContext(conf)

val l = List(1,2,3,4,5,6,7,8)
val l1 = l.map(item => (item, 1 to item toList))
val l2 = l1.map(item => (item._1, item._2.toParArray))
val l3 = sc.parallelize(l2)
l3.sortBy(_._1).foreach(t => t._2.map(x => {println(t._1 + " " +Thread.currentThread.getName); x / 2}))

在这种情况下,当我使用parArray时,我看到16个线程正在工作,而当我使用简单数组时,只有2个线程在工作.这可以看作是2个工人有8个可用线程.

In this case when I use parArray I see 16 threads working and when I used simple Array only 2 threads worked. This may be seen as 2 workers having avaialble 8 threads.

另一方面,并​​行集合的每个逻辑都可以更改为简单类型的RDD转换.

On the other hand every logic of the parallel collection may be changed to RDD transformations of simple types.

使用那些并行的集合是否受到鼓励并被认为是好的做法?

Is using those parallel collections encouraged and considered good practice?

推荐答案

使用那些并行的集合是否受到鼓励并被认为是好的做法?

Is using those parallel collections encouraged and considered good practice?

不太可能.考虑以下事实:

Unlikely. Consider following facts:

  • 对于资源管理器,任务内部的任何并行执行都是完全不透明的,因此它无法自动分配所需的资源.
  • 您可以使用spark.task.cpus明确要求任务中特定数量的线程,但这是全局设置,无法根据上下文进行调整,因此无论是否使用资源,您都可以有效地阻止资源. /li>
  • 如果确实存在线程未充分利用的问题,那么您可以随时增加分区数量.
  • Any parallel execution inside a task is completely opaque for the resource manager and as result it cannot automatically allocate required resources.
  • You can use spark.task.cpus to explicitly ask for a specific number of threads within a task but it is a global setting and cannot be adjusted depending on the context so you effectively block resources no matter if you use them or not.
  • If threads underutilization is a valid concern you can always increase number of partitions.

最后,让我们

并行收集相当复杂且难以管理(隐式线程池).对于更多的基本线程管理来说,这是有好处的,但是Spark本身内置了更复杂的并行化.

Parallel collection is fairly complicated and difficult to manage (implicit thread pools). It is good for more the basic thread management, but Spark itself has much more sophisticated parallelization built-in.

这篇关于在Spark中鼓励使用并行集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆