如何在使用Spark DataFrame写入时自动计算numRepartition [英] How to auto calculate numRepartition while using spark dataframe write

查看：418 发布时间：2020/11/22 2:52:00 apache-spark hadoop hive

本文介绍了如何在使用Spark DataFrame写入时自动计算numRepartition的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我尝试将数据帧写入Hive Parquet分区表

When I tried to write dataframe to Hive Parquet Partitioned Table

df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")

它将在HDFS中创建很多块，每个块只有很小的数据量.

It will create a lots of blocks in HDFS, each of the block only have small size of data.

我了解每个spark子任务将创建一个块，然后向其写入数据的过程.

I understand how it goes as each spark sub-task will create a block, then write data to it.

我也理解，大量的块会提高Hadoop的性能，但达到阈值后也会降低性能.

I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.

如果我想自动设置numPartition，有人有个好主意吗?

If i want to auto set numPartition, does anyone have a good idea?

numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
  .partitionBy("key")
  .format("hive")
  .saveAsTable("db.table")

推荐答案

首先，为什么要在已经使用partitionBy(key)的情况下进行额外的重新分区步骤-数据将基于密钥进行分区.

First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.

通常，您可以按列值进行重新分区，这是一种常见的情况，有助于进行诸如reduceByKey的操作，基于列值的过滤等操作.例如，

Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,

val birthYears = List(
  (2000, "name1"),
  (2000, "name2"),
  (2001, "name3"),
  (2000, "name4"),
  (2001, "name5")
)
val df = birthYears.toDF("year", "name")

df.repartition($"year")

这篇关于如何在使用Spark DataFrame写入时自动计算numRepartition的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在使用Spark DataFrame写入时自动计算numRepartition [英] How to auto calculate numRepartition while using spark dataframe write

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在使用Spark DataFrame写入时自动计算numRepartition [英] How to auto calculate numRepartition while using spark dataframe write

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭