Spark - foreach vs foreachPartitions什么时候使用? [英] Spark - foreach Vs foreachPartitions When to use What?
问题描述
我想知道,如果 foreachPartitions
会导致更好的性能,因为与 foreach
方法考虑我正在流过 RDD
的情况,以便执行一些累加器到累加器变量。
然而,有时候你想在每个节点上做一些操作。例如,建立与数据库的连接。你不能只是建立一个连接,并将它传递给 foreach
函数:连接只能在一个节点上进行。
<因此,使用
foreachPartition
,您可以在运行循环之前在每个节点上连接到数据库。 I would like to know if the foreachPartitions
will results in better performance, due to an higher level of parallelism, compared to the foreach
method considering the case in which I'm flowing through an RDD
in order to perform some sums into an accumulator variable.
foreach
auto run the loop on many nodes.
However, sometimes you want to do some operations on each node. For example, make a connection to database. You can not just make a connection and pass it into the foreach
function: the connection is only made on one node.
So with foreachPartition
, you can make a connection to database on each node before running the loop.
这篇关于Spark - foreach vs foreachPartitions什么时候使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!