什么是格洛姆?它与mapPartitions有何不同? [英] What is a glom?. How it is different from mapPartitions?

查看:93
本文介绍了什么是格洛姆?它与mapPartitions有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在RDD上遇到过glom()方法.根据文档

I've come across the glom() method on RDD. As per the documentation

返回通过将每个分区中的所有元素合并到一个数组中而创建的RDD

Return an RDD created by coalescing all elements within each partition into an array

glom会在分区上对数据进行混洗还是仅将分区数据作为数组返回?在后一种情况下,我相信使用mapPartitions可以实现相同的目的.

Does glom shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions.

我还想知道是否有任何用例可以受益于glom.

I would also like to know if there are any use cases that benefit from glom.

推荐答案

glom是否在各个分区之间随机播放数据

Does glom shuffle the data across partitions

不,不是

如果是第二种情况,我相信使用mapPartitions可以实现相同的目的

If this is the second case I believe that the same can be achieved using mapPartitions

它可以:

rdd.mapPartitions(iter => Iterator(_.toArray))

,但同样的情况也适用于任何非改组转换,例如mapflatMapfilter.

but the same thing applies to any non shuffling transformation like map, flatMap or filter.

如果有任何可以从glob中受益的用例.

if there are any use cases which benefit from glob.

任何情况下,您都需要以不止一次可遍历的形式访问分区数据.

Any situation where you need to access partition data in a form that is traversable more than once.

这篇关于什么是格洛姆?它与mapPartitions有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆