在Spark中鼓励使用并行集合 [英] Is using parallel collections encouraged in Spark

查看：103 发布时间：2020/5/24 21:13:04 scala apache-spark parallel-processing

本文介绍了在Spark中鼓励使用并行集合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Spark上使用并行集合有意义吗?

Does it make sense to use parallel collections on Spark?

到目前为止，我看到的所有Spark示例都始终使用非常简单的数据类型(单个类和元组)的RDD.但是实际上，集合(特别是并行集合)可以用作RDD的居民.

All the Spark examples I saw so far always used RDD of a very simple data types (Single classes and Tuples). But in fact collections and specifically parallel collections may be used as residents of the RDD.

该工作程序可能具有几个可用于执行的核心，并且如果使用常规集合作为RDD驻留，则这些额外的核心将保持空闲状态.

The worker may have several cores available for execution and if a regular collection is used as RDD resident those extra cores will stay idle.

测试我与本地经理一起跑过.

Test I ran with local manager.

val conf: SparkConf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val sc = new SparkContext(conf)

val l = List(1,2,3,4,5,6,7,8)
val l1 = l.map(item => (item, 1 to item toList))
val l2 = l1.map(item => (item._1, item._2.toParArray))
val l3 = sc.parallelize(l2)
l3.sortBy(_._1).foreach(t => t._2.map(x => {println(t._1 + " " +Thread.currentThread.getName); x / 2}))

在这种情况下，当我使用parArray时，我看到16个线程正在工作，而当我使用简单数组时，只有2个线程在工作.这可以看作是2个工人有8个可用线程.

In this case when I use parArray I see 16 threads working and when I used simple Array only 2 threads worked. This may be seen as 2 workers having avaialble 8 threads.

另一方面，并行集合的每个逻辑都可以更改为简单类型的RDD转换.

On the other hand every logic of the parallel collection may be changed to RDD transformations of simple types.

使用那些并行的集合是否受到鼓励并被认为是好的做法?

Is using those parallel collections encouraged and considered good practice?

在Spark中鼓励使用并行集合 [英] Is using parallel collections encouraged in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中鼓励使用并行集合 [英] Is using parallel collections encouraged in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭