如何在Spark中使用功能mapPartitionsWithIndex? [英] How to use function mapPartitionsWithIndex in Spark?

查看:457
本文介绍了如何在Spark中使用功能mapPartitionsWithIndex?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

mapPartitionsWithIndex具有参数preservesPartitioning,我不知道如何设置它.

the mapPartitionsWithIndex has a parameter preservesPartitioning, I don't know how to set it.

我做了一个测试:

// partitionedRDD's type is RDD[(String, String)]
partitionedRDD.mapPartitionsWithIndex((index, iter) => {
                iter.map(_._1)
            }, args(2).toBoolean).saveAsTextFile(args(3))

无论我将preservesPartitioning设置为false还是true,RDD分区均未更改.为什么?

whatever I set preservesPartitioning to false or true, the RDD partitions has not been changed. Why?

如果我不想更改分区,我应该为preservesPartitioning设置什么值?

If I wantn't changed the partitions, what should I set value for preservesPartitioning?

推荐答案

我认为您对servesPartitioning的含义感到困惑.通过将其设置为true,并不是说Spark'请保留部分',而是告诉它'我有一个保留键的功能,并且RDD是一对RDD'.

I think you are confused by preservesPartitioning meaning. By setting it to true, you are not saying to Spark 'please preserve the partions' you are telling it 'I have a function that preserves keys and the RDD is a pair RDD'.

通过Spark文档:

preservesPartitioning指示输入函数是否保留分区程序,除非这是一对RDD并且输入函数未修改键,否则应为false.

preservesPartitioning indicates whether the input function preserves the partitioner, which should be false unless this is a pair RDD and the input function doesn't modify the keys.

在您的情况下,您有一对RDD,并且该函数不会修改密钥,因此该标志应为true.

In your case, you have a pair RDD and the function does not modify the key so the flag should be true.

这篇关于如何在Spark中使用功能mapPartitionsWithIndex?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆