Spark partitionBy 比没有它慢得多 [英] Spark partitionBy much slower than without it
问题描述
我测试了写作:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
但是,如果我省略分区:
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
它的执行速度提高了 100 倍(!).
It executes 100x(!) faster.
相同数量的数据在分区时写入时间要长 100 倍是否正常?
Is it normal for the same amount of data to take 100x longer to write when partitioning?
分别有 10 和 3000 个唯一的 id
和 name
列值.DataFrame
有 10 个额外的整数列.
There are 10 and 3000 unique id
and name
column values respectively.
The DataFrame
has 10 additional integer columns.
推荐答案
第一个代码片段会将每个分区的 Parquet 文件写入文件系统(本地或 HDFS).这意味着如果您有 10 个不同的 id 和 3000 个不同的名称,此代码将创建 30000 个文件.我怀疑创建文件、写入 Parquet 元数据等的开销非常大(除了 shuffle).
The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark 不是最好的数据库引擎,如果您的数据集适合内存,我建议使用关系数据库.使用它会更快、更容易.
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.
这篇关于Spark partitionBy 比没有它慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!