如何在使用Spark DataFrame写入时自动计算numRepartition [英] How to auto calculate numRepartition while using spark dataframe write
问题描述
当我尝试将数据帧写入Hive Parquet分区表
When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
它将在HDFS中创建很多块,每个块只有很小的数据量.
It will create a lots of blocks in HDFS, each of the block only have small size of data.
我了解每个spark子任务将创建一个块,然后向其写入数据的过程.
I understand how it goes as each spark sub-task will create a block, then write data to it.
我也理解,大量的块会提高Hadoop的性能,但达到阈值后也会降低性能.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
如果我想自动设置numPartition,有人有个好主意吗?
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
推荐答案
首先,为什么要在已经使用partitionBy(key)
的情况下进行额外的重新分区步骤-数据将基于密钥进行分区.
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)
- your data would be partitioned based on the key.
通常,您可以按列值进行重新分区,这是一种常见的情况,有助于进行诸如reduceByKey的操作,基于列值的过滤等操作.例如,
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")
这篇关于如何在使用Spark DataFrame写入时自动计算numRepartition的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!