Spark:分区参数与分区中列参数的顺序 [英] Spark: Order of column arguments in repartition vs partitionBy

查看:446
本文介绍了Spark:分区参数与分区中列参数的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑的方法(Spark 2.2.1):

  1. DataFrame.repartition(采用partitionExprs: Column*参数的两个实现)
  2. DataFrameWriter.partitionBy
  1. DataFrame.repartition (the two implementations that take partitionExprs: Column* parameters)
  2. DataFrameWriter.partitionBy

注意:这个问题并不问这些方法之间的区别

来自

Note: This question doesn't ask the difference between these methods

From docs of partitionBy:

如果指定,则输出将放置在类似于Hive分区方案的文件系统上.例如,当我们按年然后按月对Dataset进行分区时,目录布局如下所示:

If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a Dataset by year and then month, the directory layout would look like:

  • year = 2016/month = 01/
  • year = 2016/month = 02/

据此,我推断列参数的顺序将决定目录的布局;因此它是相关.

From this, I infer that the order of column arguments will decide the directory layout; hence it is relevant.

来自

使用spark.sql.shuffle.partitions作为分区数,返回由给定的分区表达式分区的新Dataset.生成的Dataset哈希分区.

Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

根据我目前的理解,repartition决定处理DataFrame时的并行度.有了这个定义,repartition(numPartitions: Int)的行为就很简单了,但是对于repartition的其他两个带有partitionExprs: Column*自变量的实现,就不能说相同了.

As per my current understanding, repartition decides the degree of parallelism in handling the DataFrame. With this definition, behaviour of repartition(numPartitions: Int) is straightforward but the same can't be said about the other two implementations of repartition that take partitionExprs: Column* arguments.

所有的事情都在说,我的疑惑如下:

All things said, my doubts are following:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆