如何可靠地写入和还原分区数据 [英] How to reliably write and restore partitioned data

查看：150 发布时间：2020/9/4 3:45:51 apache-spark

本文介绍了如何可靠地写入和还原分区数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种写入和还原分区数据集的方法.出于这个问题的目的，我可以接受两个分区的RDD:

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:

val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)

和Dataset[Row]/Dataframe:

df.repartition($"someColumn")

目标是在恢复数据时避免随机播放.例如:

The goal is to avoid shuffle when data is restored. For example:

spark.range(n).withColumn("foo", lit(1))
  .repartition(m, $"id")
  .write
  .partitionBy("id")
  .parquet(path)

不应要求进行以下操作:

shouldn't require shuffle for:

spark.read.parquet(path).repartition(m, $"id")

我曾考虑将分区Dataset写入Parquet，但我相信Spark不会使用此信息.

I thought about writing partitioned Dataset to Parquet but I believe that Spark doesn't use this information.

我只能使用磁盘存储，而不能使用数据库或数据网格.

I can work only with disk storage not a database or data grid.

推荐答案

它可能是通过bucketBy在dataframe/dataset api中实现的，但是有一个陷阱-直接保存到实木复合地板将不起作用，只有saveAsTable可以起作用./p>

It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.

Dataset<Row> parquet =...;
parquet.write()
  .bucketBy(1000, "col1", "col2")
  .partitionBy("col3")
  .saveAsTable("tableName");

sparkSession.read().table("tableName");

火花核心的另一种方法是使用自定义RDD，例如，参见 https://github.com/apache/spark/pull/4449 -即在读完hdfs rdd之后，您可以设置分区器了，但是它有点hacky，本机不支持(因此需要针对每个spark版本进行调整)

Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)

这篇关于如何可靠地写入和还原分区数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何可靠地写入和还原分区数据 [英] How to reliably write and restore partitioned data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何可靠地写入和还原分区数据 [英] How to reliably write and restore partitioned data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭