Spark/Parquet 分区是否保持排序? [英] Do Spark/Parquet partitions maintain ordering?

查看：80 发布时间：2021/6/14 19:23:56 apache-spark pyspark parquet

本文介绍了Spark/Parquet 分区是否保持排序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我对一个数据集进行分区，当我读回它时，它的顺序是否正确?例如，考虑以下 pyspark 代码:

If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code:

# read a csv
df = sql_context.read.csv(input_filename)

# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))

# write out to parquet
df.write.parquet(output_path, partitionBy=['hash'])

# read back the file
df2 = sql_context.read.parquet(output_path)

我正在对 customer_id 存储桶进行分区.当我读回整个数据集时，是否保证分区按原始插入顺序合并在一起?

I am partitioning on a customer_id bucket. When I read back the whole data set, are the partitions guaranteed to be merged back together in the original insertion order?

现在，我不太确定，所以我要添加一个序列列:

Right now, I'm not so sure, so I'm adding a sequence column:

df = df.withColumn('seq', monotonically_increasing_id())

不过，我不知道这是否多余.

However, I don't know if this is redundant.

推荐答案

不，不能保证.用很小的数据集试试看:

No, it's not guaranteed. Try it with even a tiny data set:

df = spark.createDataFrame([(1,'a'),(2,'b'),(3,'c'),(4,'d')],['customer_id', 'name'])

# add a hash column
hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType())
df = df.withColumn('hash', hash_udf(df['customer_id']))

# write out to parquet
df.write.parquet("test", partitionBy=['hash'], mode="overwrite")

# read back the file
df2 = spark.read.parquet("test")

df.show()

+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
|          1|   a|   1|
|          2|   b|   2|
|          3|   c|   3|
|          4|   d|   0|
+-----------+----+----+

df2.show()

+-----------+----+----+
|customer_id|name|hash|
+-----------+----+----+
|          2|   b|   2|
|          1|   a|   1|
|          4|   d|   0|
|          3|   c|   3|
+-----------+----+----+

这篇关于Spark/Parquet 分区是否保持排序?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark/Parquet 分区是否保持排序? [英] Do Spark/Parquet partitions maintain ordering?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark/Parquet 分区是否保持排序? [英] Do Spark/Parquet partitions maintain ordering?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭