如何可靠地写入和还原分区数据 [英] How to reliably write and restore partitioned data
问题描述
我正在寻找一种写入和还原分区数据集的方法.出于这个问题的目的,我可以接受两个分区的RDD
:
I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD
:
val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)
和Dataset[Row]
/Dataframe
:
df.repartition($"someColumn")
目标是在恢复数据时避免随机播放.例如:
The goal is to avoid shuffle when data is restored. For example:
spark.range(n).withColumn("foo", lit(1))
.repartition(m, $"id")
.write
.partitionBy("id")
.parquet(path)
不应要求进行以下操作:
shouldn't require shuffle for:
spark.read.parquet(path).repartition(m, $"id")
我曾考虑将分区Dataset
写入Parquet,但我相信Spark不会使用此信息.
I thought about writing partitioned Dataset
to Parquet but I believe that Spark doesn't use this information.
我只能使用磁盘存储,而不能使用数据库或数据网格.
I can work only with disk storage not a database or data grid.
推荐答案
它可能是通过bucketBy在dataframe/dataset api中实现的,但是有一个陷阱-直接保存到实木复合地板将不起作用,只有saveAsTable可以起作用./p>
It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.
Dataset<Row> parquet =...;
parquet.write()
.bucketBy(1000, "col1", "col2")
.partitionBy("col3")
.saveAsTable("tableName");
sparkSession.read().table("tableName");
火花核心的另一种方法是使用自定义RDD,例如,参见 https://github.com/apache/spark/pull/4449 -即在读完hdfs rdd之后,您可以设置分区器了,但是它有点hacky,本机不支持(因此需要针对每个spark版本进行调整)
Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)
这篇关于如何可靠地写入和还原分区数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!