Apache Spark:什么时候不使用mapPartition和foreachPartition? [英] Apache Spark : When not to use mapPartition and foreachPartition?

查看:160
本文介绍了Apache Spark:什么时候不使用mapPartition和foreachPartition?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道,当我们要为一组RDD而不是单个RDD元素初始化一些资源时,理想情况下应使用mapPartition和foreachPartition.例如,在为每个数据分区初始化JDBC连接的情况下.但是在某些情况下,我们不应该使用它们中的任何一个,而应该使用普通香草map()和foreach()转换和操作.

I know that when we want to initialize some resource for a group of RDDs instead of individual RDD elements we should ideally use the mapPartition and foreachPartition. For example in case of initializing a JDBC connection for each partition of data. But are there scenarios where we should not use either of them and instead use plain vanilla map() and foreach() transformation and action.

推荐答案

编写使用mapPartition或foreachPartition的Spark作业时,您可以修改分区数据本身,也可以分别遍历分区数据.作为参数传递的匿名函数将在执行程序上执行,因此没有可行的方法来执行从一个特定执行程序调用所有节点的代码,例如df.reduceByKey.只能从驱动程序节点执行此代码.因此,只有从驱动程序代码中,您才能访问数据框,数据集和Spark会话.

When you write Spark jobs that uses either mapPartition or foreachPartition you can just modify the partition data itself or just iterate through partition data respectively. The anonymous function passed as parameter will be executed on the executors thus there is not a viable way to execute a code which invokes all the nodes e.g: df.reduceByKey from one particular executor. This code should be executed only from the driver node. Thus only from the driver code you can access dataframes, datasets and spark session.

请找到

Please find here a detailed discussion over this issue and possible solutions

这篇关于Apache Spark:什么时候不使用mapPartition和foreachPartition?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆