collectAsMap()函数如何用于Spark API [英] How does the collectAsMap() function work for Spark API

查看:64
本文介绍了collectAsMap()函数如何用于Spark API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解在spark中运行 collectAsMap ()函数时会发生什么.据Pyspark文档说,

I am trying to understand as to what happens when we run the collectAsMap() function in spark. As per the Pyspark docs,it says,

collectAsMap(个体)将此RDD中的键/值对作为字典返回给主服务器.

collectAsMap(self) Return the key-value pairs in this RDD to the master as a dictionary.

对于核心火花,

def collectAsMap():Map [K,V]返回此RDD中的键值对成为地图的主人.

def collectAsMap(): Map[K, V] Return the key-value pairs in this RDD to the master as a Map.

当我尝试在pyspark中为列表运行示例代码时,得到以下结果:

When I try to run a sample code in pyspark for a List, I get this result:

对于scala,我得到以下结果:

and for scala I get this result:

对于为什么不返回列表中的所有元素,我有些困惑.有人可以帮我了解这种情况下发生的情况,以及为什么我得到选择性的结果.

I am a little confused as to why it is not returning all the elements in the List. Can somebody help me understand what is happening in this scenario as to why I am getting selective results.

谢谢.

推荐答案

Scala和Python API之间 collectAsMap 的语义是相同的,因此我将看第一个WLOG.

The semantics of collectAsMap are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap explicitly states:

警告:这不会返回多图(因此,如果您对同一个键有多个值,则在返回的映射中每个键仅保留一个值)

Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)

特别是,当前实现将键值对按顺序插入到结果映射中,因此,在您的两个示例中,只有最后两个对有效.

In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.

如果改用 collect ,它将返回 Array [(Int,Int)] ,而不会丢失任何对.

If you use collect instead, it will return Array[(Int,Int)] without losing any of your pairs.

这篇关于collectAsMap()函数如何用于Spark API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆