Spark:分区参数与分区中列参数的顺序 [英] Spark: Order of column arguments in repartition vs partitionBy
问题描述
考虑的方法(Spark 2.2.1
):
-
DataFrame.repartition
(采用partitionExprs: Column*
参数的两个实现) -
DataFrameWriter.partitionBy
DataFrame.repartition
(the two implementations that takepartitionExprs: Column*
parameters)DataFrameWriter.partitionBy
注意:这个问题并不问这些方法之间的区别
Note: This question doesn't ask the difference between these methods
From docs of partitionBy
:
如果指定,则输出将放置在类似于
Hive
的分区方案的文件系统上.例如,当我们按年然后按月对Dataset
进行分区时,目录布局如下所示:
If specified, the output is laid out on the file system similar to
Hive
's partitioning scheme. As an example, when we partition aDataset
by year and then month, the directory layout would look like:
- year = 2016/month = 01/
- year = 2016/month = 02/
据此,我推断列参数的顺序将决定目录的布局;因此它是相关.
From this, I infer that the order of column arguments will decide the directory layout; hence it is relevant.
来自
使用 Returns a new 根据我目前的理解, As per my current understanding, 所有的事情都在说,我的疑惑如下: All things said, my doubts are following: 这两种方法之间唯一的相似之处在于它们的名称.有不同的用途和不同的机制,因此您根本不应该将它们进行比较. The only similarity between these two methods are their names. There are used for different things and have different mechanics so you shouldn't compare them at all. 话虽如此, That being said, 分区方法中也相关的列输入顺序? the order of column inputs relevant in repartition method too? 是的.
如果我们在同一列上使用GROUP BY运行SQL查询,那么为并行执行而提取的每个块是否都包含与每个组相同的数据? Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns? 取决于确切的问题. 这篇关于Spark:分区参数与分区中列参数的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!spark.sql.shuffle.partitions
作为分区数,返回由给定的分区表达式分区的新Dataset
.生成的Dataset
被哈希分区.
Dataset
partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset
is hash partitioned.repartition
决定处理DataFrame
时的并行度.有了这个定义,repartition(numPartitions: Int)
的行为就很简单了,但是对于repartition
的其他两个带有partitionExprs: Column*
自变量的实现,就不能说相同了.repartition
decides the degree of parallelism in handling the DataFrame
. With this definition, behaviour of repartition(numPartitions: Int)
is straightforward but the same can't be said about the other two implementations of repartition
that take partitionExprs: Column*
arguments.
partitionBy
方法一样,列的顺序输入是否也与repartition
方法相关?
GROUP BY
查询?repartition(columnExprs: Column*)
方法的行为
partitionBy
method, is the order of column inputs relevant in repartition
method too?
SQL
query with GROUP BY
on same columns?repartition(columnExprs: Column*)
method推荐答案
repartition
使用以下命令对数据进行混洗:repartition
shuffles data using:
partitionExprs
,它使用spark.sql.shuffle.partitions
在表达式中使用的列上使用哈希分区程序.partitionExprs
和numPartitions
与上一个相同,但覆盖spark.sql.shuffle.partitions
.numPartitions
可以使用RoundRobinPartitioning
重新排列数据.
partitionExprs
it uses hash partitioner on the columns used in the expression using spark.sql.shuffle.partitions
.partitionExprs
and numPartitions
it does the same as the previous one, but overriding spark.sql.shuffle.partitions
.numPartitions
it just rearranges data using RoundRobinPartitioning
.
hash((x, y))
通常与hash((y, x))
不同.df = (spark.range(5, numPartitions=4).toDF("x")
.selectExpr("cast(x as string)")
.crossJoin(spark.range(5, numPartitions=4).toDF("y")))
df.repartition(4, "y", "x").rdd.glom().map(len).collect()
[8, 6, 9, 2]
df.repartition(4, "x", "y").rdd.glom().map(len).collect()
[6, 4, 3, 12]
GROUP BY
具有相同的列集将导致分区上键的逻辑分布相同.GROUP BY
仅看到"实际的组.
GROUP BY
with the same set of columns will result in the same logical distribution of keys over partitions.GROUP BY
"sees" only the actual groups.