Spark CollectAsMap [英] Spark CollectAsMap

查看:178
本文介绍了Spark CollectAsMap的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道collectAsMap在Spark中的工作方式。更具体地说,我想知道所有分区的数据汇总将在哪里进行?聚集发生在主服务器或工人中。在第一种情况下,每个工作程序都会在主服务器上发送其数据,而当主服务器从每个工作人员收集数据时,主服务器将汇总结果。在第二种情况下,工人负责汇总结果(在他们之间交换数据之后),然后将结果发送给主数据库。

I would like to know how collectAsMap works in Spark. More specifically I would like to know where the aggregation of the data of all partitions will take place? The aggregation either takes place in master or in workers. In the first case each worker send its data on master and when the master collects the data from each one worker, then master will aggregate the results. In the second case the workers are responsible to aggregate the results(after they exchange the data among them) and after that the results will be sent to the master.

对于我来说,找到一种方法是至关重要的,这样主机才能能够从每个分区分别收集数据,而无需工人交换数据。

It is critical for me to find a way so as the master to be able collect the data from each partition separately, without workers exchange data.

推荐答案

您可以在此处查看他们的工作状况。由于RDD类型是元组,因此看起来它们只是使用常规的RDD收集,然后将元组转换为键,值对的映射。但是他们确实在评论中提到不支持多地图,因此您需要跨数据进行一对一的键/值映射。

You can see how they are doing collectAsMap here. Since the RDD type is a tuple it looks like they just use the normal RDD collect and then translate the tuples into a map of key,value pairs. But they do mention in the comment that multi-map isn't supported, so you need a 1-to-1 key/value mapping across your data.

collectAsMap函数

collect所做的是执行Spark作业,并从worker中获取每个分区的结果,并将它们聚集在驱动程序上的reduce / concat阶段。

What collect does is execute a Spark job and get back the results from each partition from the workers and aggregates them with a reduce/concat phase on the driver.

收集函数

因此,考虑到应该是驱动程序从每个分区分别收集数据,而无需工作人员交换数据来执行 collectAsMap

So given that, it should be the case that the driver collects the data from each partition separately without workers exchanging data to perform collectAsMap.

请注意,如果在使用 collectAsMap 之前在RDD上进行转换,这会导致随机播放这可能是导致工人之间相互交换数据的中间步骤。查看您的集群主设备的应用程序用户界面,以查看有关spark如何执行应用程序的更多信息。

Note, if you are doing transformations on your RDD prior to using collectAsMap that cause a shuffle to occur, there may be an intermediate step that causes workers to exchange data amongst themselves. Check out your cluster master's application UI to see more information regarding how spark is executing your application.

这篇关于Spark CollectAsMap的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆