Spark分区比没有分区慢得多 [英] Spark partitionBy much slower than without it

查看：296 发布时间：2020/9/4 6:07:15 scala apache-spark apache-spark-sql parquet

本文介绍了Spark分区比没有分区慢得多的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我测试了以下内容的写作:

I tested writing with:

 df.write.partitionBy("id", "name")
    .mode(SaveMode.Append)
    .parquet(filePath)

但是，如果我忽略了分区:

However if I leave out the partitioning:

 df.write
    .mode(SaveMode.Append)
    .parquet(filePath)

它的执行速度提高了100倍(！).

It executes 100x(!) faster.

相同的数据量在分区时花费100倍的更长的写入时间正常吗?

Is it normal for the same amount of data to take 100x longer to write when partitioning?

分别有10个和3000个唯一的id和name列值. DataFrame有10个附加的整数列.

There are 10 and 3000 unique id and name column values respectively. The DataFrame has 10 additional integer columns.