无法在PySpark中设置随机播放分区的数量 [英] Not able to set number of shuffle partition in pyspark

查看:176
本文介绍了无法在PySpark中设置随机播放分区的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道默认情况下,spark的任务分区数设置为200.我似乎无法改变这一点.我正在使用Spark 1.6运行Jupyter.

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.

我正在使用笔记本中的以下内容从hive加载一个大约37,000行的相当小的表

I'm loading a fairly small table with about 37K rows from hive using the following in my notebook

from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()

输出确认200个任务.从活动日志中,它可以分解200个任务,这实在是太过分了.似乎上面的第2行被忽略了.因此,我尝试了以下操作:

The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:

test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)

并创建一个新的单元格:

and create a new cell:

print test.rdd.getNumPartitions()
print test.count()

输出显示5个分区,但是日志显示200个任务正在为计数而旋转,然后重新分区为5个.但是,如果我先将其转换为RDD,然后按以下方式转换回DataFrame:

The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:

 test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd

并创建一个新的单元格:

and create a new cell:

print test.getNumPartitions()
print test.toDF().count()

我第一次运行新单元时,它仍在运行200个任务.但是,第二次运行新单元时,它执行了5个任务.

The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.

如何让代码在第一次运行时就执行5个任务?

How can I make the code run with 5 tasks the very first time it's running?

您介意解释为什么这样做(指定分区数,但仍在默认设置下运行)吗?是因为默认的Hive表是使用200个分区创建的?

Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?

推荐答案

在笔记本的开头,执行以下操作:

At the beginning of your notebook, do something like this:

from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)

笔记本启动时,您已经为您创建了一个 SparkContext ,但是您仍然可以更改配置并重新创建它.

When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.

至于 spark.default.parallelism ,我了解这就是您所需要的,请看

As for spark.default.parallelism, I understand it is what you need, take a look here:

通过类似的转换返回的RDD中的默认分区数加入,reduceByKey,并在未由用户设置时进行并行化.

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

这篇关于无法在PySpark中设置随机播放分区的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆