什么时候使用mapParitions和mapPartitionsWithIndex? [英] when to use mapParitions and mapPartitionsWithIndex?

查看:148
本文介绍了什么时候使用mapParitions和mapPartitionsWithIndex?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PySpark文档描述了两个功能:

The PySpark documentation describes two functions:

mapPartitions(f, preservesPartitioning=False)

   Return a new RDD by applying a function to each partition of this RDD.

   >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
   >>> def f(iterator): yield sum(iterator)
   >>> rdd.mapPartitions(f).collect()
   [3, 7]

然后...

mapPartitionsWithIndex(f, preservesPartitioning=False)

   Return a new RDD by applying a function to each partition of this RDD, 
   while tracking the index of the original partition.

   >>> rdd = sc.parallelize([1, 2, 3, 4], 4)
   >>> def f(splitIndex, iterator): yield splitIndex
   >>> rdd.mapPartitionsWithIndex(f).sum()
   6

这些功能试图解决哪些用例?我不明白为什么会要求他们.

What use cases do these functions attempt to solve? I can't see why they would be required.

推荐答案

要回答此问题,我们需要将map与mapPartitions/mapPartitionsWithIndex进行比较(mapPartitions和mapPartitionsWithIndex几乎做同样的事情,除了mapPartitionsWithIndex之外,您可以跟踪正在分区的分区已处理).

To answer this question we need to compare map with mapPartitions/mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except with mapPartitionsWithIndex you can track which partition is being processed).

现在,mapPartitions和mapPartitionsWithIndex用于优化应用程序的性能.只是为了理解,我们假设RDD中的所有元素都是XML元素,并且需要解析器来处理每个元素.因此,您必须以一个好的解析器类为例来进行操作.您可以通过两种方式做到这一点:

Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for the sake of understanding let's say all the elements in your RDD are XML elements and you need a parser to process each of them. So you have to take an instance of a good parser class to move ahead with. You could do it in two ways:

map + foreach:在这种情况下,将为每个元素创建一个解析器类的实例,对该元素进行处理,然后将该实例及时销毁,但不会将该实例销毁用于其他元素.因此,如果您使用的RDD由分布在4个分区中的12个元素组成,则解析器实例将创建12次.如您所知,创建实例是非常昂贵的操作,因此需要时间.

map + foreach: In this case for each element, an instance of the parser class will be created, the element will be processed and then the instance will be destroyed in time but this instance will not be used for other elements. So if you are working with an RDD of 12 elements distributed among 4 partitions, the parser instance will be created 12 times. And as you know creating an instance is a very expensive operation so it will take time.

mapPartitions/mapPartitionsWithIndex :这两种方法可以稍微解决上述情况. mapPartitions/mapPartitionsWithIndex适用于分区,而不适用于元素(请不要误会,所有元素都将被处理).这些方法将为每个分区创建一次解析器实例.并且由于您只有4个分区,因此解析器实例将创建4次(对于此示例,它是map的8倍).但是您将传递给这些方法的函数应该使用一个 Iterator 对象(以一次获取一个分区的所有元素作为输入).因此,在使用mapPartitions和mapPartitionsWithIndex的情况下,将创建解析器实例,将处理当前分区的所有元素,然后将该实例稍后由GC销毁.您会注意到它们可以显着提高应用程序的性能.

mapPartitions/mapPartitionsWithIndex: These two methods are able to address the above situation a little bit. mapPartitions/mapPartitionsWithIndex works on the partitions, not on the elements (please don't get me wrong, all elements will be processed). These methods will create the parser instance once for each partition. And as you have only 4 partitions, the parser instance will be created 4 times (for this example 8 times less than map). But the function you will pass to these methods should take an Iterator object (to take all the elements of a partition at once as input). So in case of mapPartitions and mapPartitionsWithIndex the parser instance will be created, all elements for the current partition will be processed, and then the instance will be destroyed later by GC. And you will notice that they can improve the performance of your application significantly.

因此,最重要的是,每当您看到某些操作对于所有元素都是通用的,并且通常情况下,您可以执行一次即可处理所有元素,因此最好使用mapPartitions/mapPartitionsWithIndex.

请找到以下两个链接以获取代码示例的解释: https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ http://apachesparkbook.blogspot.in/2015/11/mappartition-example.html

Please find the below two links for explanations with code example: https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ http://apachesparkbook.blogspot.in/2015/11/mappartition-example.html

这篇关于什么时候使用mapParitions和mapPartitionsWithIndex?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆