如何在使用Spark DataFrame写入时自动计算numRepartition [英] How to auto calculate numRepartition while using spark dataframe write

查看:418
本文介绍了如何在使用Spark DataFrame写入时自动计算numRepartition的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将数据帧写入Hive Parquet分区表

When I tried to write dataframe to Hive Parquet Partitioned Table

df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")

它将在HDFS中创建很多块,每个块只有很小的数据量.

It will create a lots of blocks in HDFS, each of the block only have small size of data.

我了解每个spark子任务将创建一个块,然后向其写入数据的过程.

I understand how it goes as each spark sub-task will create a block, then write data to it.

我也理解,大量的块会提高Hadoop的性能,但达到阈值后也会降低性能.

I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.

如果我想自动设置numPartition,有人有个好主意吗?

If i want to auto set numPartition, does anyone have a good idea?

numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
  .partitionBy("key")
  .format("hive")
  .saveAsTable("db.table")

推荐答案

首先,为什么要在已经使用partitionBy(key)的情况下进行额外的重新分区步骤-数据将基于密钥进行分区.

First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.

通常,您可以按列值进行重新分区,这是一种常见的情况,有助于进行诸如reduceByKey的操作,基于列值的过滤等操作.例如,

Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,

val birthYears = List(
  (2000, "name1"),
  (2000, "name2"),
  (2001, "name3"),
  (2000, "name4"),
  (2001, "name5")
)
val df = birthYears.toDF("year", "name")

df.repartition($"year") 

这篇关于如何在使用Spark DataFrame写入时自动计算numRepartition的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆