获取PySpark中可见节点的数量 [英] getting number of visible nodes in PySpark

查看:243
本文介绍了获取PySpark中可见节点的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在PySpark中运行一些操作,并且最近增加了我的配置(在Amazon EMR上)中的节点数量.但是,即使我将节点数量增加了三倍(从4个增加到12个),性能似乎也没有改变.因此,我想看看Spark是否可以看到新节点.

I'm running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number of nodes (from 4 to 12), performance seems not to have changed. As such, I'd like to see if the new nodes are visible to Spark.

我正在调用以下函数:

sc.defaultParallelism
>>>> 2

但是我认为这是告诉我分配给每个节点的任务总数,而不是Spark可以看到的代码总数.

But I think this is telling me the total number of tasks distributed to each node, not the total number of codes that Spark can see.

如何查看集群中PySpark正在使用的节点数量?

How do I go about seeing the amount of nodes that PySpark is using in my cluster?

推荐答案

sc.defaultParallelism只是一个提示.根据配置,它可能与节点数无关.如果您使用带有分区计数参数但未提供该参数的操作,则这是分区的数量.例如,sc.parallelize将从列表中创建一个新的RDD.您可以使用第二个参数告诉它在RDD中创建多少个分区.但是此参数的默认值为sc.defaultParallelism.

sc.defaultParallelism is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism.

您可以在Scala API中使用sc.getExecutorMemoryStatus获取执行程序的数量,但这在Python API中并未公开.

You can get the number of executors with sc.getExecutorMemoryStatus in the Scala API, but this is not exposed in the Python API.

通常,建议在RDD中将分区的数量大约是执行程序的4倍.这是一个很好的技巧,因为如果在任务花费的时间上存在差异,这将使结果变得均匀.例如,某些执行者将处理5项较快的任务,而其他执行者将处理3项较慢的任务.

In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.

您不必对此非常准确.如果您有个大概的想法,可以估算一下.就像您知道自己的CPU少于200个一样,您可以说500个分区就可以了.

You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.

因此,请尝试使用以下数量的分区创建RDD:

So try to create RDDs with this number of partitions:

rdd = sc.parallelize(data, 500)     # If distributing local data.
rdd = sc.textFile('file.csv', 500)  # If loading data from a file.

或者,如果您无法控制RDD的创建,请在计算之前重新划分RDD:

Or repartition the RDD before the computation if you don't control the creation of the RDD:

rdd = rdd.repartition(500)

您可以使用rdd.getNumPartitions()检查RDD中的分区数.

You can check the number of partitions in an RDD with rdd.getNumPartitions().

这篇关于获取PySpark中可见节点的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆