Spark:分区参数与分区中列参数的顺序 [英] Spark: Order of column arguments in repartition vs partitionBy

查看：446 发布时间：2020/9/4 3:16:32 apache-spark dataframe apache-spark-sql partitioning

本文介绍了Spark:分区参数与分区中列参数的顺序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑的方法(Spark 2.2.1):

DataFrame.repartition(采用partitionExprs: Column*参数的两个实现)
DataFrameWriter.partitionBy

DataFrame.repartition (the two implementations that take partitionExprs: Column* parameters)
DataFrameWriter.partitionBy

注意:这个问题并不问这些方法之间的区别

来自

Note: This question doesn't ask the difference between these methods

From docs of partitionBy:

如果指定，则输出将放置在类似于Hive的分区方案的文件系统上.例如，当我们按年然后按月对Dataset进行分区时，目录布局如下所示:

If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a Dataset by year and then month, the directory layout would look like:

year = 2016/month = 01/
year = 2016/month = 02/

据此，我推断列参数的顺序将决定目录的布局；因此它是相关.

From this, I infer that the order of column arguments will decide the directory layout; hence it is relevant.

来自

使用spark.sql.shuffle.partitions作为分区数，返回由给定的分区表达式分区的新Dataset.生成的Dataset被哈希分区.

Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

根据我目前的理解，repartition决定处理DataFrame时的并行度.有了这个定义，repartition(numPartitions: Int)的行为就很简单了，但是对于repartition的其他两个带有partitionExprs: Column*自变量的实现，就不能说相同了.

As per my current understanding, repartition decides the degree of parallelism in handling the DataFrame. With this definition, behaviour of repartition(numPartitions: Int) is straightforward but the same can't be said about the other two implementations of repartition that take partitionExprs: Column* arguments.

所有的事情都在说，我的疑惑如下:

All things said, my doubts are following:

像partitionBy方法一样，列的顺序输入是否也与repartition方法相关?

如果上述问题的答案是
- 否:如果运行 chunk 是否包含与每个 group 中相同的数据? >在同一列上使用GROUP BY查询?
- 是:请说明repartition(columnExprs: Column*)方法的行为
- If the answer to above question is
  - No: Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns?
  - Yes: Please explain the behaviour of repartition(columnExprs: Column*) method
  推荐答案
  
  这两种方法之间唯一的相似之处在于它们的名称.有不同的用途和不同的机制，因此您根本不应该将它们进行比较.
  
  The only similarity between these two methods are their names. There are used for different things and have different mechanics so you shouldn't compare them at all.
  
  话虽如此，repartition使用以下命令对数据进行混洗:
  
  That being said, repartition shuffles data using:
  - 对于partitionExprs，它使用spark.sql.shuffle.partitions在表达式中使用的列上使用哈希分区程序.
  - 使用partitionExprs和numPartitions与上一个相同，但覆盖spark.sql.shuffle.partitions.
  - 使用numPartitions可以使用RoundRobinPartitioning重新排列数据.
  - With partitionExprs it uses hash partitioner on the columns used in the expression using spark.sql.shuffle.partitions.
  - With partitionExprs and numPartitions it does the same as the previous one, but overriding spark.sql.shuffle.partitions.
  - With numPartitions it just rearranges data using RoundRobinPartitioning.
  分区方法中也相关的列输入顺序?
  
  the order of column inputs relevant in repartition method too?
  
  是的. hash((x, y))通常与hash((y, x))不同.
```
df = (spark.range(5, numPartitions=4).toDF("x")
    .selectExpr("cast(x as string)")
    .crossJoin(spark.range(5, numPartitions=4).toDF("y")))

df.repartition(4, "y", "x").rdd.glom().map(len).collect()
```
```
[8, 6, 9, 2]
```
```
df.repartition(4, "x", "y").rdd.glom().map(len).collect()
```
```
[6, 4, 3, 12]
```
  如果我们在同一列上使用GROUP BY运行SQL查询，那么为并行执行而提取的每个块是否都包含与每个组相同的数据?
  
  Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a SQL query with GROUP BY on same columns?
  
  取决于确切的问题.
  - 不.哈希分区程序可以将多个键映射到同一分区. GROUP BY仅看到"实际的组.
  - Yes. GROUP BY with the same set of columns will result in the same logical distribution of keys over partitions.
  - No. Hash partitioner can map multiple keys to the same partition. GROUP BY "sees" only the actual groups.
  相关的如何定义DataFrame的分区?
  
  这篇关于Spark:分区参数与分区中列参数的顺序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

推荐答案

Spark:分区参数与分区中列参数的顺序 [英] Spark: Order of column arguments in repartition vs partitionBy

问题描述

注意:这个问题并不问这些方法之间的区别

Note: This question doesn't ask the difference between these methods

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:分区参数与分区中列参数的顺序 [英] Spark: Order of column arguments in repartition vs partitionBy

问题描述

注意:这个问题并不问这些方法之间的区别

Note: This question doesn't ask the difference between these methods

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭