获取PySpark中可见节点的数量 [英] getting number of visible nodes in PySpark
问题描述
我正在PySpark中运行一些操作,并且最近增加了我的配置(在Amazon EMR上)中的节点数量.但是,即使我将节点数量增加了三倍(从4个增加到12个),性能似乎也没有改变.因此,我想看看Spark是否可以看到新节点.
I'm running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number of nodes (from 4 to 12), performance seems not to have changed. As such, I'd like to see if the new nodes are visible to Spark.
我正在调用以下函数:
sc.defaultParallelism
>>>> 2
但是我认为这是告诉我分配给每个节点的任务总数,而不是Spark可以看到的代码总数.
But I think this is telling me the total number of tasks distributed to each node, not the total number of codes that Spark can see.
如何查看集群中PySpark正在使用的节点数量?
How do I go about seeing the amount of nodes that PySpark is using in my cluster?
推荐答案
sc.defaultParallelism
只是一个提示.根据配置,它可能与节点数无关.如果您使用带有分区计数参数但未提供该参数的操作,则这是分区的数量.例如,sc.parallelize
将从列表中创建一个新的RDD.您可以使用第二个参数告诉它在RDD中创建多少个分区.但是此参数的默认值为sc.defaultParallelism
.
sc.defaultParallelism
is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize
will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism
.
您可以在Scala API中使用sc.getExecutorMemoryStatus
获取执行程序的数量,但这在Python API中并未公开.
You can get the number of executors with sc.getExecutorMemoryStatus
in the Scala API, but this is not exposed in the Python API.
通常,建议在RDD中将分区的数量大约是执行程序的4倍.这是一个很好的技巧,因为如果在任务花费的时间上存在差异,这将使结果变得均匀.例如,某些执行者将处理5项较快的任务,而其他执行者将处理3项较慢的任务.
In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.
您不必对此非常准确.如果您有个大概的想法,可以估算一下.就像您知道自己的CPU少于200个一样,您可以说500个分区就可以了.
You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.
因此,请尝试使用以下数量的分区创建RDD:
So try to create RDDs with this number of partitions:
rdd = sc.parallelize(data, 500) # If distributing local data.
rdd = sc.textFile('file.csv', 500) # If loading data from a file.
或者,如果您无法控制RDD的创建,请在计算之前重新划分RDD:
Or repartition the RDD before the computation if you don't control the creation of the RDD:
rdd = rdd.repartition(500)
您可以使用rdd.getNumPartitions()
检查RDD中的分区数.
You can check the number of partitions in an RDD with rdd.getNumPartitions()
.
这篇关于获取PySpark中可见节点的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!